Hacker News new | comments | show | ask | jobs | submit login
MacOS may lose data on APFS-formatted disk images (bombich.com)
396 points by mpweiher 5 months ago | hide | past | web | favorite | 180 comments

Yes, fairly limited impact, but still. Data loss is fucking important. We aren't talking about an occasional stutter when moving windows or a sound driver that sometimes needs a reboot to resume working. This is a major bug. I would really really like it if Apple would get their shit together so that people who actually rely on their computers to work correctly can upgrade at some point.

As a counterpoint, APFS was deployed to almost a billion devices over the span of several months when it was release and, IIRC, this is the first major issue.

How do you know that? Filesystem corruption is frequently silent, and every-time it happens customers don't get on the phone and send the disks to apple so that they can root cause the problem. Its quite possible this bug has happened an untold number of times before it happened to someone who went through the effort to reproduce and isolate it.

Also, I'd guess the use pattern on ios is rather different, and more homogeneous, from the use pattern in macos. I don't think these millions of devices really give Apple good code coverage.

>How do you know that?

Can you point to any other APFS issues that were reported before this one?

My wife's laptop suddenly decided that the boot drive was corrupt a few weeks after she updated to APFS. None of the recovery tools were of any use. We had to reinstall the OS and pull the files from a backup. This story did not make it to Hacker News.

I thought Diskwarrior supported APFS - did you try that one?

Anecdata != filesystem issue.

If you have more information about the problem you encountered and how it implicates/interacts with APFS, please do link to it. Otherwise, bug reports via circumstantial evidence are, while not inherently false, certainly suspect.

You missed the point. The anecdote was to illustrate how even power users might be working around filesystem bugs so a lack of bug reports specifically mentioning APFS is certainly not proof that there aren't problems.

It's also quite hard to report fs issues. I ended up one day with a not working apfs system. Boot was ok, but I couldn't mount the user partition. Apfs repair tool just failed and made the system hang. After a number of restarts, attempts at repair, and attempts to move the partion somewhere it can be decrypted, everything started working. And I actually had enough experience to try and debug/fix it - many people would end up wiping the system, or having to go to Apple shop.

This is not reportable. I got only a generic error or hanging system. I can't reproduce it. I don't know why it started and why it finished. Yet, it was almost certainly an apfs issue.

Even if I wanted to play, my priority was to get the work laptop usable again.

That's reasonable. I assumed that it was an assertion of evidence of an issue, not an example of how issues might theoretically go unnoticed. Upon rereading it does not appear that either is implied.

I'm not sure they're implying that it -was- an APFS issue, just that in the majority of circumstances users won't go through the same level as effort to diagnose an issue as in the article. Instead of pulling drives and trying to reproduce the error, they just wiped the drive and started fresh.

I could be wrong, but I believe the point is not that it did happen, but that this -could- have happened many times in the past and users just format/re-install without thinking about it.

The hardware is likely being blamed for a lot of failures that are software related. Its almost assured there is a software problem in cases where reinstalling the machine fixes the problem. A random machine which won't boot due to disk/filesystem failures could be a hardware issue, but that is pretty much ruled out if reinstalling/reformatting doesn't immediately manifest in further failure. Bit rot, stuck bits, bad links are a thing, but they generally show up as massive soft error correction long before it reaches the point of simply being unable to read the sector and when that happens the OS will almost always tell you that the sector can't be read rather than giving you garbage data.

That is because the likelyhood of undetected hardware failures given the layers and layers of ECC on the disks, links/etc manifesting itself as filesystem meta data failures rather than garbage in the middle of video/images/document streams/etc is really unlikely. Or the more likely case of the machine performance degrading due to read retry/ecc correction/retransmission making the machine appear to have severe performance issues long before it manifests as silent data corruption sufficient to eat the filesystem structure (its a fun excise to intentionally flip a few random bits on a hard-drive image (or in RAM)) and see if/when they are detected.

So, yes the first thing I think when I hear filesystem corruption is BUG! That is what the experience of tracking down a number of incidents in a large data storage application a few years ago taught me.

You asked for an example, and you got it.

You're confusing your posters. freehunter asked for an example. zbentley complained that it was anecdata/specious.

I have an iPhone 6+ that's stuck in an infinite boot loop. Everyone thinks it's a BGA solder ball failure on the NAND flash part.

Can you be 100% sure it's not an APFS fuckup?

to be fair, BGA solder joint issues are ridiculously common on that model. to the point that, as someone who's services phones for years, i warn people away from them even if they're dirt cheap

The rest of the post after the question mark you stopped reading at:

> Filesystem corruption is frequently silent, and every-time it happens customers don't get on the phone and send the disks to apple so that they can root cause the problem. Its quite possible this bug has happened an untold number of times before it happened to someone who went through the effort to reproduce and isolate it.

We know that because it was deployed to almost a billion devices without a problem. If there was a problem that affected even a fraction of a percent of people, it would have been all over the news given how many devices that is.

Edit: Thanks for the downvotes. If you disagree, please tell me why. Apple's deployment of APFS to iPhones was so flawless, most people probably still don't even know they did it.

Idk, most of the people in my life could probably lose data on their phone and never realize it. Especially if we’re talking about people with lots of duplicates of the same selfie, for instance.

The downvotes will be because you're adding nothing to the conversation. You're just repeating your grandparent post's point.

You’re wrong. I’m explaining why deploying to a billion devices and seeing no public outcry means there was a problem, because the parent comment is trying to claim that maybe there was a problem and people just didn’t report it. My point is that with a billion devices, even rare edge cases end up reported in the media because rare edge cares still hit so many people that it makes it look like a widespread issue.

for a sufficiently narrow definition of "major issue" that excludes

- the encryption password hint leak ("that was Disk Utility, not APFS")

- APFS volume erasure issues ("also Disk Utility")

- Adobe, Unity (editor and games built with it), Steam, Source Engine, and FCPX crash, performance, and asset loss issues on APFS volumes, all of which went away when moved to HFS+ volumes ("those teams should have adapted their software during beta")

- performance and incompatibility issues with spinning-disk drives ("platters are bad, APFS is designed for SSDs")

- RAID kernel panics, even on supported RAID 0 configurations ("that's corecrypto, not APFS")

There were multiple unicode issues on APFS reported here on HN, but I think those were before it went live.

As far as you know.

File systems fall into the category of software where a bug can have disastrous consequences. Even if the probability of a bug is small, the magnitude of the consequence means that the overall risk is still high. And the current quality of software coming from Apple is so bad that the probability is not low.

For myself, I'm not letting APFS near my systems for at least a couple more years.

I'd point out it was deployed to iPhones - devices where the underlying physical disk won't get smaller so this bug would not have appeared.

iOS devices are extremely constrained in a number of ways that MacOS isn't - who knows how many other bugs have failed to surface because Apple thought their iOS test was a 'job done' moment.

Some of the upgrades to High Sierra in my company failed and left the laptop unbootable. Unfortunately, I wasn't involved in the repair so I only have been told it was due to the APFS conversion and that the solution was to wipe and install.

Yeah. Frankly, the APFS rollout was incredibly smooth for a new filesystem.

Not contradicting you but this makes me wonder if writable disk images are used at all behind the scenes in iOS.

It irrecoverably destroyed my partition on upgrade to High Sierra.

There was the password vs password hint thing.

Which was a bug in Disk Utility, not APFS.

Right, my bad.

refs (microsoft's new filesystem)'s deployment went along much better. i haven't heard of any major issues since it came out in windows server 2012.

You haven't been reading then.

For people doing enterprise work and backups it's been a nightmare - here's one backup vendor that's been tracking issues with high RAM+CPU usage for almost 2 years now [1]. Early on if data reached over 2.0 TB, it would silently corrupt on certain cluster sizes and when deduplication was enabled. [2] Per the Veeam thread, the "fix" for [1] is only preventative, meaning that currently affected volumes will need to reformat entirely.

This doesn't excuse the APFS goofs, but silent data corruption and grinding servers to a halt just writing data to the system are pretty major show stoppers, never mind that ReFS can't be used for a host of every-day operations (i.e., it's a storage level solution, not really an every-day-driver style File System).

[1] - https://forums.veeam.com/veeam-backup-replication-f2/refs-4k...

[2] - https://blogs.technet.microsoft.com/filecab/2017/01/30/windo...

ReFS was only ever made the default on new installs of Windows Server 2012. It never actually made it to production builds of Windows 8 or above so it's actually only installed on a fraction of the systems that are out there. That's not really a sufficient sampling size to say that this deployment went along much better. The APFS update was a much, much larger endeavor and, based only on public response, was nearly seamless. This is the first major issue I've heard about with regard to APFS.

ReFS was not the default FS on Windows Server 2012, for many reasons, but one being that you cannot boot from it [1].

[1]: http://www.windowsnetworking.com/articles-tutorials/windows-...

yes, huge deploy but not that seamless.

1. i had a filevault related corruption issue, the disk was eating itself up thinking it was encrypting... don't have the apple discussion link at hand.

2. time machine hidden snapshot, "disk full" issues. it's a major pita for me that is not possible to turn off local snapshots.

it's a major pita for me that is not possible to turn off local snapshots.

Try this terminal command:

   sudo tmutil disablelocal
Edit: I'm still on Sierra. Someone else mentioned that HS removed that. But also pointed to: https://forums.macrumors.com/threads/solution-reclaim-storag...

Good luck!

It's rather obnoxious that they prod us to upgrade every other day while all this is in the news.

I updated day one and had usual apple problems (finder broke so I had to use my terminal, sometimes it wouldn't wake up from sleep so I had to force it to reboot, my fans would spin up for seemingly no reason). Most of the issues were fixed a week later or so.

Anyway, I recently switched to Arch on a 2018 LG Gram and I'm not really missing anything. Battery life is great (8-12 hours of Firefox) and it has a quad core x64 processor for non-browser things.

Windows Ultrabooks are worth the purchase again.

Yeah, same. I couldn't stomach the port situation on the new macbooks (and the High Sierra issues) so ended up with a 4th gen Lenovo Carbon X1. Wiped Windows 10 and installed Mint. The whole process was dead-simple and I couldn't be happier with it.

Thanks for mentioning the LG Gram. I actually didn't know LG sold laptops (they're not available in Canada) before your comment. I just ordered one online now! I insta-purchased it when I noticed the 2018 model still has USB 3 Type-A and HDMI ports. Ctrl-key in the corner instead of Fn-key and equal sized arrow keys are a bonus.

Fits my needs as a Macbook Pro replacement. Will be running linux desktop on it too. Probably elementary OS.

Or just stop using macOS. I wish I could give you my Debian Thinkpad for a week. Granted I've put some time into tweaking it because the sock gnome3 rounded matte shit is shit, but I bet you'd convert. The only issue is the screen aspect ratio Apple still wins there.

The first thing I do with a new PC laptop, after making sure Bluetooth and the other hardware all works in Windows, is to wipe it and start installing Linux. Lately I've just cloned my Gentoo install that I've maintained since 2012.

I keep around one Win10 laptop for gaming, but I prefer Linux for any real development work.

Two bugs are described in this article:

1. An APFS volume's free space doesn't reflect a smaller amount of free space on the underlying disk

2. The diskimages-helper application doesn't report errors when write requests fail to grow the disk image

These are not even complex problems of the new format. It is just Apple forgot to have basic checks. It is like the root access with an empty password incident happened 2 months ago. Why these serious but basic problems happen? What is going on with Apple?

(1) is incorrect. Sparsebundles are capable of over-provisioning the storage below them _by design_. They always have been. They are, I believe, APFS snapshots essentially now. This behavior is consistent with most other filesystems with similar constructs.

(2) is the real issue here.

> Sparsebundles are capable of over-provisioning the storage below them _by design_.

According to TFA, HFS+ sparsebundles reflect the limitations of their underlying volume, while APFS sparsebundles do not. Seems clear to me that this is a bug.

Sparse bundle disk images [1] are a more efficient way to store disk images on an underlying file system. They are a completely separate concept from the file system in the disk image itself (besides HFS+ apparently doing some intelligent free space reporting) and are definitely not implemented as APFS snapshots.

[1] https://en.wikipedia.org/wiki/Sparse_image#Sparse_bundle_dis...

This is by far the best explanation of what’s happening at Apple and why we’re noticing these bugs and problems: https://www.imore.com/understanding-apple-scale

Not really.

If you built a new filesystem, competent software engineers will heavily test the corner cases. What happens when the fs runs out of space? What happens when the metadata store runs out of space? etc.

The original article mentions bugs that are pretty obvious cases to test. What precisely happens when you have a sparsebundle that exceeds the storage capacity of the containing volume? A PM needs to define what should happen and an eng needs to test that it does.

It's inexcusable that things like this aren't tested and is an organizational failure. This isn't some complex interaction of earbuds with watch and a cloud system. This is a very testable filesystem.

Agreed. Many of the issues with Apple's software quality lately are fundamentals, not issues of scale. Stuff like this, the root password bug, the many issues with reliability in High Sierra, etc.

These things speak to organizational issues.

File systems are hard. They're incredibly difficult and incredibly important. It look a long time to get to ext4 and NTFS, and there's a reason HFS+ has stuck around for as long as it has, despite all of its shortcomings and limitations.

Even in the Linux work, thinks like btrfs, even though some distros consider it sable, is still treated with scrutiny. Back in the early 2000s, many Linux distros refused to install on XFS or JFS.

Apple's APFS rollout really does fell like it happened way too fast.

Maybe they tested everything and didn't care. That's unlikely but possible.

My guess is more that MacOS just gets the B or C team, and no-one involved was smart enough or diligent enough to think through the implications of sparsebundles that don't reserve space.

Rene Ritchie is very biased. He has to be as his livelihood depends on Apple. I take his words with many grains of salt. The only reason he's quoting Sinofsky is because it gives him a way of excusing Apple's software stumbles of late.

Also, nobody is harder on Apple than the people who know the company best. He’d lose credibility if he pretended everything was all rainbows and unicorns when they clearly are not. He clearly cites how unacceptable some of these bugs are—that’s not being biased.

Like Ritchie, I go back to the days of when the Macintosh Operating System shipped on floppies and didn’t have pre-emptive multitasking or memory protection—everything ran in the same memory space. The entire system would crash pretty regularly due to INIT (system extensions) conflicts, for example.

I can count on one hand the number of times my Mac has kernel panicked over the last few years and I regularly run beta versions of macOS.

So we now measure the reliability of a consumer-grade system on closed hardware by how few times it experiences a kernel panic?

That’s a very low bar. At least Windows has the excuse of having to work with a bazillion drivers.

I am not sure what your point is. If macOS has fewer kernel panics on semi-closed hardware (what laptop/PC hardware is more open these days?) than Windows or Linux on a bazillion drivers, then macOS would be preferable to me if the hardware us acceptable, even if supporting a bazillion drivers is a greater feat.

The point is that Apple provides both OS and drivers for a relatively small set of hardware. In these conditions, the fact that it has kernel panics at all is just Bad. Kernel panics should not happen in 2018, we have the knowledge and the technology to make it happen. It's bad for Windows and Linux as well, but they have the drivers excuse at least - MacOS has no such excuse.

Rene Ritchie is very biased. He has to be as his livelihood depends on Apple.

No, he quoted Sinofsky because he’s one of the few people in the world who understands what it’s like trying to operate at this scale, since he was at Microsoft during it’s heyday.

Corner cases that affect only .01 percent of the installed bases aren’t a big deal when you’re operating at a few million; it’s entirely different when it’s more than a billion devices.

The software issues haven't been corner cases lately. iOS 11 is just... bad software. High sierra has had some rather embarrassing issues too and, I'll say again, they aren't corner cases.

What difference does 1mil vs. 1bil make when they're deploying software on identical hardware? The type security and stability bugs showing in modern macOS and iOS are unacceptable at 100.000 installs. What does volume have to do with accepting empty root passwords?!

> Apple forgot to have basic checks

A common issue with Apple lately.

Sure, it's a new filesystem; it's bound to have bugs. However, one pain point I've identified is that the existing tools often have no idea of how to deal with APFS. I'm currently typing this from a Mac with an APFS drive that is almost certainly experiencing filesystem corruption–I have folders that suddenly lose track of their contents and become impossible to delete; however, existing tools such as fsck, diskutil, etc. can't do anything to fix the issue, because their idea of how APFS works is woefully inadequate.

Yep, I had an APFS volume that I couldn't even health check. I was pretty sure it was filesystem corruption as it'd hang and shut off randomly then come back up without issues, same as when the SSD died the first time. Also am r APFS time machine volume would never finish encrypting, even after being plugged in for days.

Why would you assume that's file system corruption?

That could be hardware failure. I've had one SSD fail, back around 2013, and the often just lock up without warning. They'll work for 30 min ~ 1 hour and just stop and freeze.

> Sure, it's a new filesystem; it's bound to have bugs.

It's literally one of the most critical components of an operating system. Bugs in the filesystem or disk utilities are not small things. They have the potential to be disastrous.

There is a fsck_apfs, I’m pretty sure that’s what the repair button in the disk util runs.


  # diskutil verifyVolume /dev/disk0s2
  Started file system verification on disk2s1 macOS
  Verifying file system
  Volume was successfully unmounted
  Performing fsck_apfs -n -x /dev/rdisk2s1
  Checking volume
  Checking the container superblock
  Checking the EFI jumpstart record
  Checking the space manager
  Checking the object map
  Checking the APFS volume superblock
  Checking the object map
  error: btn: invalid key (210, 16)
  Object map is invalid
  The volume /dev/rdisk2s1 could not be verified completely
  File system check exit code is 8
  Restoring the original state found as mounted
  Error: -69845: File system verify or repair failed
  Underlying error: 8: Exec format error

`fsck_apfs -n -l -x /dev/rdisk1s1`

APFS not having block or file-level checksums really seems like a big oversight to me. While the filesystem’s designers considered the hardware-level guarantees to be sufficient [1], this issue shows that there is an entire class of problems that they have not considered. Disk images and loopback-mounted filesystems or even disk-level cloning introduce additional layers of complexity where a filesystem can be silently corrupted, even when the actual physical storage layer is perfectly reliable.

A filesystem should be able to last for decades (HFS was designed thirty years ago); I regard not having checksums in a brand new filesystem an over-optimistic tradeoff.

[1] http://dtrace.org/blogs/ahl/2016/06/19/apfs-part5/

Apple HFS+ didn't have file data checksums either. The default filesystem on most Linux distributions, ext4 doesn't, it just stores checksums of the file metadata, not the file data. Same story with Windows NTFS filesystem. Microsofts newer ReFS filesystem has file data checksums disabled by default. So it seems like a tradeoff that most of the major operating system are making. Most likely related to performance.

Edit: macOS disk images do have a checksum of the whole image data though. The issue mentioned in the article seems to be caused by an oversight in the disk image helper app, rather than in the APFS filesystem itself.

Performance is only an issue if your disk can write faster than your CPU can hash. hammer2 changed its hash a couple of years ago because this started happening with newer NVMe drives[0], but before that disk writes weren't CPU-bound.

[0] http://lists.dragonflybsd.org/pipermail/commits/2016-June/50...

What about: * Disk reads may unnecessarily trash the CPU caches, because CPU will need to verify the checksum when the DMA read is done, even if the app isn't going to process the data immediately afterwards * Battery life - without checksums the CPU can stay mostly idle, and go into a lower power mode, while the disk controller does its job

APFS not having block or file-level checksums really seems like a big oversight to me.

It wasn't an oversight; it was a deliberate design decision.

Others replies have noted that maybe it was a performance issue, etc. But I think it's something much different. The real reason is that there is more downside than upside to reporting these errors.

Users would be very upset if iOS told them that there was a bad block in one of their precious selfies from last month. But they might not even notice or care about a few bad pixels in the image itself.

I'm just telling you what one big probable rationalization was for this decision. I'd personally want to know, but people on HN aren't "average" IOS users.

That's not really any better. That just supports the narrative that Apple only cares about frivolous uses of their devices ("toys") and aren't serious about supporting Pro users (who will definitely want to restore corrupt files from backups)

I tend to create sparsebundles to clone my git repos within, to get around the overhead of having huge numbers of inodes on a volume. (Copying, deleting, unpacking archives, Spotlight indexing—all are way slower when you have the worktrees and .git directories from a thousand large repos splayed out across your disk.) So I was a little worried here.

Thankfully, I had manually been setting my sparsebundles back to HFS+ on creation, because I saw no reason to make them APFS containers.

This is a neat idea, I've never thought about doing that. You should write a blog post about it.

TBH an I/O system where having a file system inside a loopback device on another file system is faster than using said file system directly in the first place sounds kinda broken-ish/poorly scaling to me.

Having Spotlight ignore .git directories and the like is probably wise, I would agree with that. But it's text, even if it's basically garbage text (from a user perspective). So I can understand how a sparsebundle is a decent end-around.

The Finder in general ends up basically being useless for me for similar reasons; I have dozens of random dependency files I don't even recognize pop up in "All My Files".

Spotlight ignores hidden directories (e.g. .git) and directories whose names end in .noindex. You can create a file in one with a unique name and try to mdfind it to verify this.

Does it ignore them, or does it traverse them and throw them out?

Anecdotally it certainly seems like indexing is slower on my dev drive than anywhere else, so I'm curious.

I'm sure it gets the events. It probably has to walk back up the tree to determine if the file is hidden. Dunno how much work it does. I presume it doesn't actually do the metadata extraction from the files. (But my presumption is based on "surely they wouldn't do that".)

The biggest offender for me when I touch a lot of files is Dropbox. It seems to use a lot of CPU when, e.g., an Xcode update is being installed. I've read that they had to listen to events for the whole volume because the more specific APIs weren't giving them the data they needed, but you'd think they could fast-path the files that were outside their sandbox.

Is your dev drive a platter drive or SSD? I've found that the last few major releases of osx have big performance issues on systems with old-school hard drives. (Frequent beach-balling, etc.)

Honestly, I don't think there's any way to get around it, if you've got an indexer daemon in the mix. It's pretty much the same as trying to store billions of rows in a RDBMS table, except that file systems+metadata indexes don't have any concept of table partitioning.

Space shared APFS volumes inside a container give you the “table partitioning” you want. You can even set them up to have different case sensitivity options. All your dev work in a case sensitive volume for instance and Adobe software on a case insensitive volume on the same space shared container.

True! My disk-image-centered workflow comes from before APFS volumes were a thing; I haven't bothered to re-evaluate it. (It is nice that I can just schlep one of these volumes around by copying one file, rather than waiting for thousands/millions of small files to compress, copying the archive, and then decompressing on the other end, though. Do you know if there's an easy method of doing a block-level export from an APFS volume to a sparsebundle disk image, or vice-versa? That'd be killer for this.)

Well, APFS is much better suited to this kind of workflow. Create a space shared logical volume inside your container and turn spotlight off on that particular volume (and if you’d like, make that volume case sensitive ). There’s no need to separate that out on a diskimage

There's still the problem of Time Machine (or any other backup software you use) needing to do a complete deep scan of the volume to ensure you haven't made any changes. If you know a git repo is effectively in a "read-only" state—something you just keep around for reference, or an old project you might "get back to" a few years from now—it can speed up those backups dramatically to put the repo into some kind of archive. Disk images, for this use-case, are just archives you can mount :)

I upgraded my hackintosh to High Sierra with APFS. The next day, I accidentally switched off the machine while it was in the _process_ of shutting down (the screen had gone blank, but casing still emitting lights).

Next time I turned it on, I couldn't get past login screen (giving me forever beach ball).

I put the ssd inside my old MBP as slave to recover data.

The ssd was corrupted, most data gone, as in shown in Finder but couldn't be copied.

I googled for solutions, but it seems I'm the first to experience this.

Hypothesis: since it does _get to_ the login screen, the OS does think that the disk is in a consistent state.

Maybe your unclean shutdown forced the async conversion from HFS+ to APFS to become forced-synchronous? Try leaving the drive in the Hackintosh machine, spinning at the login screen, for a few hours. Maybe it’ll “finish.”

If it has Filevault, the "login" screen is the firmware's disk unlock screen. In my experience that will often load OK despite the filesystem being completely corrupted.

The data is probably still there just some metadata being corrupted making it inaccesible to standard tools.

Wow, almost sounds like the two step commit isn't working properly for APFS.

...So not an actual Apple Product.

My (limited) understanding of APFS is that it forgoes some integrity checks on the assumption that they have already been done by lower-level hardware. This is of course a debatable design decision, but it may indeed be unwise to use APFS on non-Apple hardware.

All modern hardware has this feature and I am unaware of apple hardware being significantly safer in this respect.

It was notable that this design decision because other modern fs refs,btrfs,zfs do feature additional integrity checks.

I guess the question would be if you suspect that Apple which is famous for marketing and ui are just smarter than the man centuries Microsoft, Oracle, Sun, have poured into filesystem research or if perhaps this is just a bad design decision.

No, APFS must be usable on USB drives and so on. That would be a fatal design flaw.

From http://dtrace.org/blogs/ahl/2016/06/19/apfs-part5/ :

"Explicitly not checksumming user data is a little more interesting. The APFS engineers I talked to cited strong ECC protection within Apple storage devices. Both flash SSDs and magnetic media HDDs use redundant data to detect and correct errors. The engineers contend that Apple devices basically don’t return bogus data. NAND uses extra data, e.g. 128 bytes per 4KB page, so that errors can be corrected and detected. (For reference, ZFS uses a fixed size 32 byte checksum for blocks ranging from 512 bytes to megabytes. That’s small by comparison, but bear in mind that the SSD’s ECC is required for the expected analog variances within the media.) The devices have a bit error rate that’s tiny enough to expect no errors over the device’s lifetime. In addition, there are other sources of device errors where a file system’s redundant check could be invaluable. SSDs have a multitude of components, and in volume consumer products they rarely contain end-to-end ECC protection leaving the possibility of data being corrupted in transit. Further, their complex firmware can (does) contain bugs that can result in data loss."

(sorry for the edits, I finally found the paragraph my memory was referring to)

But if they're so confident in the disk, then why do they checksum the metadata? They should either trust the disk and have no checksums or not trust the disk and checksum everything.

There are plenty of other reasons not to checksum user data, as it's a choice many have made, but that they trust the disk is an invalid argument.

ZFS is the only widely deployed file system to do data checksumming by default though. You can’t blame APFS for not doing it when no other file system does it either.

I am ignorant as to how that could matter? It's managed writes to a hard disk. How could the brand of hard disk, or the mobo, or whatever, matter in this situation?

For the most part the hardware in on a hackintosh isn't going to be worse than what apple is selling. I might even say its likely better given that ECC memory or maybe SAS/FC attached disks may be in the hackintosh (although as others have said the disk sector ECC, ECC on the transport layers (SAS, SATA, etc) are all much better today then they were 30 years ago). So while the rates of silent hardware based corruption may be the same or lower, the real reason for using CRC/Hashing at the filesystem/application level is to detect software bugs.

The latter may be more prevalent on the hacintosh due to simply being a different hardware environment. A disk controller driver variation, or even having 2x as many cores as any apple products might be enough to trigger a latent bug.

So basically, I would be willing to bet that the vast majority of data corruption is happening due to OS bugs (not just the filesystem, but page management/etc) with firmware bugs on SSD's in a distant second. The kinds of failures that get all the press (media failures, link corruption, etc) are rarely corrupting data because as they fail the first indication is simply failure to read the data back because the ECC's cannot reconstruct the data and simply return failure codes. Its only once some enormous number of hard failures have been detected does it get to the point where a few of them leak through as false positives (the ECC/data protection thinks the data is correct and returns an incorrect block).

The one thing that is more likely is getting the wrong sector back, but overwhelmingly the disk/etc vendors have gotten smart about assuring that they are encoding the sector number alongside the data (and DIF for enterprise products) so that one of the last steps before returning it is verifying that the sector numbers actually match the requested sector. That helps to avoid RAID or SSD firmware bugs that were more common a decode ago.

Are you sure the disk wasn't encrypted ?

And given you are running on a Hackintosh there isn't much anyone can do given the unsupported hardware.

I get the feeling that many of those who comment didn’t read the article. It says

> Note: What I describe below applies to APFS sparse disk images only — ordinary APFS volumes (e.g. your SSD startup disk) are not affected by this problem. While the underlying problem here is very serious, this is not likely to be a widespread problem, and will be most applicable to a small subset of backups.

A friend of mine lost a ton of data this week after mac os crashed and he restarted it. The only things he had done since getting the computer a week ago were:

1. update to high sierra 2. copy over files from old mac 3. record about 40gb of screen share data using quicktime (which is what he was doing when it crashed)

He spent hours on the phone with apple, the tech said he had never seen anything like it and they weren't able to recover his data... but after reading the other horror stories in this thread there seems to be some serious problems with high sierra and/or APFS.

I had a kernel panic a few weeks ago that left the OS in a state where it was unable to boot. It seemed to think it was mid-upgrade and was complaining about missing the packages an upgrade would read from. Thankfully macOS has an in-place OS reinstall option and as far as I’ve been able to see all my data is totally fine. But it was bizarre and I’ve never experienced anything like it in a decade of using Macs.

Everything running fine now after the OS reinstall?

That sounds like a failing or faulty SSD. I had this problem on HFS+ on my fairly new 2011 MacBook Air. Downloaded a ton of data, system became unstable and wouldn’t boot after.

Lost a ton of data? Doesn't he use Time Machine?

Yes, although he was traveling so he wasn't able to recover until he returned home and even then, there was a gap in his backup (his fault) and his screenshare recording wasn't backed up since it was done while traveling. Thankfully he had already sent his presentation ahead to the conference he was presenting at, so he was at least able to complete that obligation.

This is a good lesson for everyone... obviously we all know to backup regularly (let Time Machine do its thing multiple times per day of course); but the lesson for me was when traveling, have everything you need backed up on a USB stick at least.

Yeah, traveling would get you :( poor guy. Backblaze or something may have helped slightly with that (not the screenshare recording, if it was too big).

And yes, a high capacity USB stick. There are very small thumb drives that you can permanently leave in the USB-port of the laptop. Unfortunately I haven't found one like that with a USB-C connector.

My AFPS volume got corrupted probably during the upgrade to High Sierra. No data lost, but my disk is missing almost half the disk space. See https://apple.stackexchange.com/q/311843/26185

I “lost” a lot of space due to time machine local backups. It was frustrating to research and I thought there was something seriously wrong with my computer. Try deleting the local backups and see if you get some space back

I had to make room on my MBP SSD to install windows through bootcamp and bashed my head against a wall on the same issue. It took me half an hour to find the reason why so much of my hard drive wasn't available despite me having deleted almost all third party apps and personal data on my macOS partition. Time Machine does 'local backups' and there is NOWHERE in the user interface that fully explains the space they occupy and how to get rid of it. To delete those local backups you need to use the terminal program tmutil.

That gave me even more vindication for my move. Also, you really don't know how fast your hardware is until you've used something other than macOS on it. From booting the system to launching software.. everything is snappier now.

Yes it’s a really poor design decision. They could have at least added “local time machine backups” color to Disk Utility, or added a way to turn them off.

Agreed re: snappiness of other OS. Ubuntu flies on my 2013 MBP.

There used to be a way to turn in off before High Sierra, now the only thing you can do is delete the local snapshots. [0]

[0] https://forums.macrumors.com/threads/solution-reclaim-storag...

unfortunately it's not the local backups. I guess the missing disk space is related to the error I get when I check the partition. I guess I need to reinstall... :(

Should have paid Oracle and just put ZFS on OS X.

Apple killed ZFS on macOS nearly 10 years ago. It wouldn’t make sense on a watch or a phone anyway: http://www.zdnet.com/article/mac-zfs-is-dead-rip/

200B in the bank and they couldn’t find a way to port it to other platforms? I think they could.

It’s not about not being able to port it; it’s not the right tool for the job. ZFS is a file system designed for servers; the Apple Watch, Apple TV and the iPhone and iPad aren’t servers.

That doesn't mean it can't be adapted and tuned for mobile. The average smartwatch has more processing power than the average server did at the time the ZFS project was launched (2001).

+1 to that. Or at least some subset that speaks the ZFS-protocol if such a thing exists.

High Sierra seems like a real gem.

Honestly, it's the first Mac OS release I'd actively recommend avoiding upgrading to. It offers essentially no benefits over the prior release, and a whole lot of downsides. (Security issues notwithstanding, High Sierra also drops compatibility for a lot of older software.)

Lion actively screwed up normal people's workflows with the botched "Save As" replacement. I still want an explanation of why they thought that was a good idea. We skipped that one after one of the admin assistants discovered the new joy.

The whole California-series of OS has had a broken Finder. I see some fixes in High Sierra, but its still buggy as heck for large file moves and broken scripting. I'm hoping they take a long hard look at Mac OS like they seem to be with the next iOS. I can forgive removing some UNIX commands, but the general bugs and unexplained crashes are starting to get on my nerves.

I know I sound like an old crank, but I've been a Mac user since 1984. And the last MacOS version that I completely trusted as being loyal to my needs and workflow was Snow Leopard. Every later version has felt like it was really Apple's OS and I was just borrowing it.

Totally agree. I used to be excited about upgrading MacOS; now I absolutely dread it. I only upgrade when Xcode etc doesn't work on old OS anymore.

Until 10.13.3 I could barely use my MBP; horrendous graphical corruption issues. How this can happen I have no idea.

It’s certainly no coincidence that the last version of OS X that you could buy with money (be a customer) was Lion, which marked the beginning of a decline.

I would be very happy to see macOS drop back to bi-yearly/loose schedule upgrades. It’s a mature platform that runs on hardware that improves significantly slower than the hardware iOS runs on; at this point insisting on a new release every year rather than when it’s ready is actively detrimental. As much as I still get excited by new features and tech, I use my Mac exclusively for work, and rock solid reliability is more important to me than new features.

As did other versions of OSX[0]. This whole meme should get a law like Betteridge’s. Complaints about the current release of macOS are as old as the hills[1] and have become entirely vacuous. That’s not to say the complaints aren’t necessarily without merit; just that there is too much hyperbole around them.

[0]https://forums.macrumors.com/threads/apps-that-wont-work-on-... [1]https://youtu.be/GWxC8ezE4Dk

As well as every version of every OS ever. Windows 98 was the best, Windows XP sucked with it's Fisher Price UI. Then Vista/7 came around and the outcry of "if you want the latest DirectX, you have to upgrade". And then Windows XP was discontinued to much wailing and gnashing of teeth, because XP was the best. Then 8/10 came out and everything was terrible and we're just going to stick to Windows 7 because it's the best.

To many techies, version n-1 is the best thing ever created, until version n+1 comes out and the "best ever" shifts up one revision.

Cherry-picked anecdotes are worthless.

Few if any people glorified Panther when Tiger came out, or Leopard when Snow Leopard came out, or Win3.11 when Win95 came out, or Win95 when Win98 came out, or WinME when WinXP came out, or Vista when Win7 came out.

Stop dismissing legitimate complaints just because you worship "new and shiny"

>Stop dismissing legitimate complaints just because you worship "new and shiny"

Yeah, right after you stop dismissing important security updates as "new and shiny".

No, he’s right. Whilst there was a bit of criticism of Windows XP, most people were fine with it and it was widely praised. Same with Windows 98. In fact, Windows ME, Windows Vista and Windows 8 were widely criticised when they came out for good reasons, and Windows 7 and Windows 10 were widely praised.

> Whilst there was a bit of criticism of Windows XP, most people were fine with it and it was widely praised.

It didn't start out that way. It was heavily criticized in the first instance. Certainly until SP1 was released.

Oh I fully understand that. However, in the past, I would argue those versions actually added useful things, so there was a bit of a tradeoff to consider.

High Sierra has essentially no enhancements whatsoever.

YMMV. I haven't seen any issues, and I jumped right to APFS (yes, I backup religiously). FWIW, High Sierra been out for months and seen 3 public and many developer releases since its dot-0.

I postponed upgrading because of reported issues with the original Magic Trackpad, and it looks like a dodged a real bullet right there.

Every few months, just as soon as I start thinking maybe I should give in to Apple’s incessant nagging and dark-pattern prompts to upgrade, stuff like this comes up.

I think I’ll just wait for the new release, or for a time when I’m ready to wipe any machine I want to upgrade.

Yet another reason to stick with El Capitan.

I really want to- but I'm locked out of Xcode 9 if I linger. If only I can run it through some sort of virtualization

VMware Fusion supports High Sierra as guest including APFS. I was under the impression that Parallels does the same. Not sure about Virtual Box though.

The same behavior (the sparse bundle disk image not updating the free space amount in accordance with the underlying disk’s free space) is present if one selects ExFAT as format in Disk Utility.

I haven’t tested if file corruption is the consequence, too, of copying more data into the disk image than the underlying disk has free space.

> To prevent errors when a filesystem inside of a sparse image has more free space than the volume holding the sparse image, HFS+ volumes inside sparse images will report an amount of free space slightly less than the amount of free space on the volume on which image resides.

Can anyone explain the "slightly less than" part of this? Why wouldn't it just be "equal to"?

The sentences following that statement in the hdiutil manual are also helpful:

"The image filesystem currently only behaves this way as a result of a direct attach action and will not behave this way if, for example, the filesystem is unmounted and remounted. Moving the image file to a different volume with sufficient free space will allow the image's filesystem to grow to its full size."

hdiutil has some of the best man pages I've ever run across.

To account for filesystem metadata?

Is the incident with the matching checksum that he mentions because APFS only checksum metadata (which are on preallocated space on the image), or is it his own checksum (say sha1sum), I wonder? It seems strange that the filesystem driver would cache 500 GB of sequentially written data in RAM.

> It seems strange that the filesystem driver would cache 500 GB of sequentially written data in RAM.

That was the most interesting/worrying part of TFA, and I would love to see how the checksum tests were conducted clarified in the text.

Presumably, the "md5" commandline tool has no special fallback to the filesystem checksum cache (if it does, rather a lot of my life has been a lie, I'm afraid). Since that's the case, could we assume that, if the "lost" writes totalled $X GB of data, that any evil memory-caching of the file will only work in the presence of at least $X GB of free system memory (RAM plus swap).

I'd also be interested in learning what happens if there's less than that amount of memory available. Will the checksum fail? Will an error occur elsewhere? Will the system have some sort of memory (and swap) exhaustion failure/panic?

The video embedded in TFA shows md5 reporting identical checksums before unmounting the disk image, so it must be reading the data from a cache.

Can you even test a filesystem properly internally?

Seems to me we're all involved in a massive public beta.

Externally "testing" a filesystem isn't exactly easy either. If an error occurs, there's not gonna be an exception thrown that you can send off to telemetry.

If you're lucky, your customer notices that files are missing, understands that it has to be a bug in the operating system and maybe even has a rough idea what they were doing that caused the bug to occur, then calls up support and your first-level support is competent enough to direct the problem to the filesystem people and then it's still going to require a lot of luck for that department to reproduce the problem and to actually find out what in the code is causing it.

Apple authored a tool called fsx (file system exerciser), but I doubt it checks for corner cases for loopback mounts. Maybe they should add such tests.

That tool has been around for ages, I think it might originally be from SGI.

They say it takes ten years to get the bugs out of a filesystem. Unfortunately I think a general release is necessary to mature a new FS.

so why exactly does somebody backup into a image with variable size, when they could just dd together a fixed size image that makes these free space calculations unnecessary?

I use a sparse bundle disk image, i.e. a disk image with variable size and consisting of multiple files under the hood, because it is more efficient to back up over a network. Instead of uploading a 50 GB file to a cloud storage on every backup, only a fraction of data has to be uploaded (the sparse bundle’s changed files) which makes backing up the 50 GB file very fast if only a few megabytes were added.

If anyone is curious: I use restic[1] as backup client and Backblaze B2[2] as backup storage. Works well with sparse bundles.

[1] https://restic.net

[2] https://www.backblaze.com/b2/cloud-storage.html

Wait, why would the 50 GB need to be transferred fully every time if it was a fixed size image?

The very first backup of a fixed size image will be the full size of the image, e.g. 50 GB, no matter which backup software you use. Even if inside the image no files exist. The very first backup of a sparse bundle disk image will be ~100 MB (the initial size of that image).

On repeated backups, some backup softwares operate on file level and upload the whole file if it changed. So if you have a fixed size 50 GB image, mount it, add a file, unmount it, it changed, and the whole 50 GB image file has to be uploaded (with some backup softwares).

Sure, but any halfway competent backup program would split it into chunks before deduplication anyway, rendering the exercise pointless.

I doubt I'd hit the bug described, but when I do a "clean install" I often create an encrypted sparse disk image onto a network volume and a copy on the Desktop of my new install. Unrelated, I do my bootable backup as a "smart backup" (only copy changes) to a usb-disk. I can see myself doing that to a sparse image on the network.

As I set up my new computer I'll move stuff or delete things out of the sparse disk image from my Desktop, then periodically reclaim space.

I only do this every 5+ years or so. I haven't done this with APFS, yet either. But those are a few of my use cases that dd wouldn't cover.

Variable size (sparse) disk images make it possible to backup e.g. two 1TB volumes that are on average <50% full to a single 1TB disk. Very useful to have for networked backups (Time Machine) of multiple machines.

Not to downplay the importance of this, but it reads as clickbait that you wait until the second paragraph to say "oh yeah, it's only sparsebundles. Just those things that almost nobody uses."

Yes, nobody - like for example Apple, to implement Time Machine.

But now perhaps I better understand why Time Machine backups aren't supported on APFS.

we use encrypted sparsebundle images to hold sensitive data: ssh keys, SSL keys, license files. I doubt I'm the only one, this is pretty important to our business.

Eh, it's a minority but not an "almost nobody" minority. There are plenty of pieces of software even besides CCC that rely on/give prominent options for the use of sparsebundles.

It's more like "the latest OS update leaves you vulnerable to malicious javascript everywhere, but only if you use Opera". It's a minority of users, sure, but it's still really important information to note.

Click bait.

What I describe below applies to APFS sparse disk images only — ordinary APFS volumes (e.g. your SSD startup disk) are not affected by this problem. While the underlying problem here is very serious, this is not likely to be a widespread problem, and will be most applicable to a small subset of backups. Disk images are not used for most backup task activity, they are generally only applicable when making backups to network volumes. If you make backups to network volumes, read on to learn more.

The title clearly qualifies the claim: "MacOS may lose data _on APFS-formatted disk images_"

I didn't know there were APFS-formatted disk images (new in 10.13). Even when you consider the many different kinds of disk images that macOS supports, there's a pretty clear distinction between disk image and a backup of your startup disk, made to another partition in another drive.

Any additional clarification would get into "MacOS may lose data on APFS-formatted disk images (disk images, not disk-to-disk, as in another volume..." territory.

Yeah this is the opposite of a click-bait title; it clearly explains what the article is about.

"may" lose data on "APFS-formatted" disk images.

A lot of people do make backups to disk images though, especially techies. I believe that remains the only method to use 'time machine' with a generic network connected device rather than an apple branded 'time machine' device? The last place you want an unreliable file system is your backups!

Quite a few non-Apple network devices support Time Machine.

I back up every day to my Synology NAS for example.

Unrelated (sorry for being opportunistic):

How do you like your Synology NAS? I’m considering it.

Not the person you replied to, but I want to make a case for ZFS on a generic small motherboard. You wind up with Linux or FreeBSD so it's a general-purpose server, unless you want to use something like FreeNAS. And with ZFS, you get snapshotting, RAIDz, checksumming, etc, as opposed to "oh it has RAID 5 woohoo".

RAID (including the software RAID in Linux) doesn't actually do checksumming for file verification. AFAIK, ZFS is the only open system to do so.

Thank you

You appear to have an agenda of protecting Apple.

The title is very clear, and the first paragraph which you quoted explains it in detail.

It's significant that an established respected company--the makers of Carbon Copy Cloner--will not support APFS formatted disk images for its backups.

While your agenda may be driven by the need to protect Apple, the rest of us need to know this important news about APFS so we can be fully informed.

Carbon Copy Cloner is an excellent program btw - very happy owner here

Completely agree. I gave up on Time Machine a long time ago for a variety of reasons, but CCC continues to meet my backup needs perfectly.

This isn't politics.

People can have opinions without them needing to be flagged as having hidden agendas. And I have worked at Apple and can assure you that there isn't some payment structure for posting on forums.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact