Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is there a data loss bug lurking in MS365 backup solutions?
75 points by ryan87 on Sept 6, 2023 | hide | past | favorite | 47 comments
It sounds crazy, and maybe I'm just doing something dumb, but I've seen a similar issue in two different MS365 backup products this year. I can't reproduce it reliably, but feel like there could be a serious issue, even though I can't prove it.

My issue is specific to OneDrive. When I reconcile a backup set against live data, files are missing. I've had it happen with both Veeam Backup for MS365 and Synology Active Backup for MS365. Neither system reports any issues when backups run. I don't know the cause and I can't reproduce it, but it happens consistently for at least one of my tenants and seems to get worse over time. I've seen the issue on more than one tenant, so I don't think it's anything tenant specific and the only (untenable) solution I've come up with is to restart the backup from scratch.

The tenant that has the most issues is a business with about 200k files. The business owner owns the files and everyone else has access to a few shared folders near the top of the hierarchy. They have about 350GB of data which ends up being about 1TB of quota after versioning.

I originally ran into the issue with Veeam by randomly spot checking 1-2 files every once in a while and running into missing data by random chance. That made me realize I needed to do some kind of bulk reconciliation on a regular basis. I gave up on Veeam because they append the OneDrive file version to all restored files and it makes it difficult to reconcile. For example, locally restored files end up with ' (ver 2)' or similar appended to the file name.

I switched to the Synology system because it's ideal to reconcile since an up-to-date backup set can be shared via SMB. That makes it possible to have an up-to-date OneDrive sync and a mapped drive to an up-to-date backup set on the same machine. After that, it's a matter of comparing two folders as long as care is taken to get a consistent point-in-time for both sets of data.

The only noteworthy thing that I think plays a part is how frequently the tenant reorganizes their data. They're always renaming and moving files and folders to keep things organized. I'd frame it as being frequent, but not unreasonable. The reason I think this is noteworthy is that in cases where I'm able to track the life-cycle of missing files, they seem to "disappear" after being impacted by a directory rename or move operation.

I can't engage with support for this particular tenant because the data is supporting documentation for government work. I can't even share examples or screenshots with the file names AFAIK.

I've seen people complaining about similar issues, but the complaints are from years ago [1]. This [2] caught my eye.

> The root cause for the missing data was due to incorrect representation of the changes from the SharePoint API side. Veeam RnD team performed an investigation and found that sometimes the SharePoint API mechanism of tracking changes did not track the changes inside the Child files or Folders inside the Site`s list.

Does anyone here reconcile their OneDrive backups well enough to say you're confident you can reliably restore your data with 100% consistency?

Is it possible there's a change tracking bug lurking in the SharePoint API? I don't know anything about how it works, so any insight would be useful. For example, would the OneDrive client and backup clients use the same change tracking? That seems unlikely based on what I'm seeing, but, again, I have no idea how it actually works.

1. https://forums.veeam.com/veeam-backup-for-microsoft-365-f47/...

2. https://forums.veeam.com/veeam-backup-for-microsoft-365-f47/...




Possibly related: A few weeks ago I was told to use the autosave option of MS word. The login dialog went through, but then nothing happened. Classic Microsoft. I reinstalled Onedrive, autosave said it's on and I go ahead with my work. That was friday 3PM. Come monday, I find that the last 2 hours of friday are gone and spend about half an hour looking for it. I Ctrl+S about every 5 seconds so I can't imagine that there's no copy of it anywhere. But there wasn't one, it was just gone.

What happened was that onedrive was stuck trying to upload my VMs (I had those in my _local_ documents folder, which onedrive claimed ownership of during reinstallation without telling me). Onedrive never showed my excel file in the list of files it was uploading, so it could have either tried to do the VMs first and never got to the excel file or quietly errored out on that without ever uploading it.

In any case, I'm still dealing with the fallout of the fatal mistake I made: Trusting a microsoft product to do the absolute fucking basics.

Maybe I'll learn this time.


I'm glad I'm no longer on the side of managing Microsoft 365 stuff for a company. I've seen this happen too many times, and of course the user takes it out on you for it. Was so much simpler and reliable in the past just having a NAS and mapping it to a drive. Even my current company still uses NAS for mission critical stuff.


You know what local NAS storage isn’t good for? Microsoft’s bottom line.


I'm in the position of that user. I have a boss telling me to not turn Team Microsoft's problems into mine. She's happier if I take someone from that team along my troubleshooting journey. It now takes two people's time to arrive at the same conclusion. And most of the time that conclusion is that microsoft screwed up and the only thing we can do is reinstall.


> Possibly related: A few weeks ago I was told to use the autosave option of MS word.

I don't think this would be related.

I have no reason to suspect any data loss from OneDrive. It's the backups that are failing for me. If I had to restore from backup right now, it wouldn't match what's stored in OneDrive.


Onedrive uses sharepoint as a backend, so maybe there's a bug in there that causes both issues.


Autosave was buggy from the begginning. It basicaly does not have any protection if for some reason the operation was interrupted.


SharePoint is a garbage machine.

If you value local documents.


I wouldn't trust any MS cloud product to not lose my data.

A year or two ago I lost data backing up my OneDrive files to my local HDD. I was told my backup had succeeded, and so I deleted the files on OneDrive. Later, when I tried to extract some large files from the backup, I saw that they had been replaced with some text files, containing a message that the download of those files failed.

What the heck... Was I expected to extract the archive and go through all the files one-by-one (there are thousands of them) to check if every file was properly backed up?


> What the heck... Was I expected to extract the archive and go through all the files one by one (there are thousands of them) to check if every file was properly backed up?

There are a lot of silent pitfalls like that in my experience. For example, if you use folder level encryption with Synology Active Backup for MS365 it can silently mangle the file names due to path length restrictions. You'll end up with files that have "file name too long" as part of the file name.

That's why I'm trying to reconcile every file in this data set. I don't trust anything without being able to personally verify it at this point.


> For example, if you use folder level encryption with Synology Active Backup for MS365 it can silently mangle the file names due to path length restrictions.

Is this Microsoft's fault or Synology's fault? Much of this discussion sounds like an inability to distinguish between first and third party products.


> Is this Microsoft's fault or Synology's fault?

It's 100% on Synology.

> Much of this discussion sounds like an inability to distinguish between first and third party products.

Yeah. Did I explain it badly or something? I thought I made it clear that I'm having an issue with 3rd party backup solutions and that it's possible they're using some shared API from Microsoft, but most people seem to be focused on OneDrive specific issues.


> Is this Microsoft's fault or Synology's fault?

It it were Microsoft's fault the filenames would have bin filena~1, filena~2 etc. /s


>I wouldn't trust any MS cloud product to not lose my data.

I raised issues of data loss and management prioritized shipping to customers instead. I can't say I blame them, they can take those risks knowing customers will accept them because their big name is behind the product. That said, I heard the issues were resolved so proceed with caution

Note: this was not onedrive


> later, when I tried to extract some large files from the backup, I saw that they had been replaced with some text files, containing a message that the download of those files failed.

That's a really asinine failure mechanism. At the very least, it should create "filename-FAILED_TO_RESTORE" or something like that.

Something that I was instructed to do nearly 30 years ago when at Motorola was: never overwrite the original files. Standard procedure was to make a directory called "BACKUP" (or "RESTORE"?) and put the files in there instead. The requesting person was then responsible for moving the files back to their original location, so any screwups were on them, not us (IT).


I think the only correct way to handle failures is to have a single file called ONE_DRIVE_RESTORE_FAILURES.txt at the restore destination root, and in it list the path to every file that failed to download, one path per line. Much harder to miss than individual files with conspicuous names.


The dogfooding was a thing 30 years afo.


Sun did it much more recently than that.


This is exactly what I've encountered with all the cloud-based file storage apps now that they've gone to "dynamic download".

You will get failed downloads quietly.


Does someone know if rclone also can fail silently when syncing onedrive? [0]

From my experience it doesn't but onedrive provides, sometimes, filesizes that are wrong, so checking file integrity is a tiny bit more of a pain.

[0] https://rclone.org/docs/


Not Microsoft, but related:

I am convinced that there's a major issue with iCloud Drive and iCloud Photos losing files and photos. I've got more than one folder on iCloud Drive I know had files in them; but nothing is there. I've got old photo albums, from say 2012 or so, that have 6 pictures in them; but the identical album in Google Photos has 9 (I always manually replicate uploads, albums, etc to Google. A backup, I guess).

The one good thing I'll say about Google is; I've got Hangouts chat logs from like freakin 2007 or something in my history. In fact, their data retention may be too good; about six months ago I logged into Google Drive (Workspace) to find a bunch of folders I know I had deleted, years ago, and I mean years, just chilling in the root folder. Stuff from 2010? 2011? It was actually a trip down memory lane, and I'm not un-glad to have recovered some of the stuff for my archives, though its unclear why it happened.

Dropbox as well. I don't use it regularly, but I signed into my Dropbox account from the 00s one day the other day, and everything was just there. For 14-15 years, never paid them a dime.

The fascination with Microsoft in the corporate world is a collective psychosis that we'll probably never recover from. Everything they make is shit. Its insecure, unreliable, doesn't do half of what it advertises, and does the other half so much worse than the worst competition that their products' adoption in organizations could by any reasonable logic be correctly viewed as intentional sabotage.


I feel like Howard Oakley (https://eclecticlight.co/) has written about issues with iCloud products but I’m sleepy and don’t feel like digging through search results.


Here’s another one. One day I started an old device which apparently loaded OneDrive and my entire OneDrive (150GB) got wiped.

It still showed links under recent files but the actual files were gone.

Customer support said they can’t do anything about it. 6 years of photos, personal files, university work and co gone.

I happened to have an old notebook with _some_ files on it.

You _cannot_ trust 365/OneDrive.


> You _cannot_ trust 365/OneDrive.

Cloud is _not_ a backup.


> They're always renaming and moving files and folders to keep things organized.

Ah, there we go! It's their won fault for assuming a filesystem can be used for basic file and folder changes. /s

On a serious note, it turns out that replicated filesystems as seen in OneDrive or DropBox are hard to engineer. Astonishingly difficult to get right. "Formal proof verifiers" level of difficult, otherwise customers will lose data in everyday circumstances. It's happened to me, it happened to all of my coworkers, and it has happened to your customers.

Data loss with a poorly engineered replication system isn't some rare corner-case, it's the typical case, and it is a miracle if data isn't lost!

A few years ago DropBox rewrote their sync engine and wrote a great blog article about all the complex scenarios they had to handle: https://dropbox.tech/infrastructure/rewriting-the-heart-of-o...

The OneDrive team has done none of this hard work, I guarantee it.

OneDrive loses data if you look at it wrong. If you call Microsoft Support they'll tell you not to look at it like that. Or they tell you to go complain on some support forum, as-if that's the industry standard for achieving engineering rigour.

I do all of my work on local C:\ paths and copy files to SharePoint or OneDrive only if forced to collaborate with colleagues. This rare interaction with OneDrive causes data loss about 5% of the time, often enough that I notice.


> I do all of my work on local C:\ paths and copy files to SharePoint or OneDrive only if forced to collaborate with colleagues.

This is an option I considered, but it's harder for non-power users to deal with the file management. It's very likely some users would copy files instead of moving them and it would quickly devolve into trampling each other's changes.

That Dropbox article was great, so thanks. If I had to summarize my issue now, I'd say it feels like the backup solutions have a poorly implemented sync engine. I'm guessing they simply get a stream of changes from SharePoint though and I wonder if it would even be plausible to think they could misapply those changes.


A backup solution makes it into an even more complex sync configuration:

    User <=> OneDrive => Backup
Multi-way synchronisation like that is full of landmines. I read a great article once about how a similar many-way, multi-master sync needs to be implemented. The context was LDAP directory synchronisation, but the concepts are similar.


> I can't engage with support for this particular tenant because the data is supporting documentation for government work. I can't even share examples or screenshots with the file names AFAIK.

Your purchase department should have access to their Microsoft-side account representative to get you access to support without breaching legal requirements.


I've been in a few companies which where we had dedicated account managers on MS side. One was even so big that there were MS engineers sitting with our teams.

Questions of actual substance were rarely answered satisfactorily, sometimes weeks of chasing got nowhere.

With this context I would not hold by breath for them to come back with anything useful.

Also having seen the kinds of untested garbage they deploy as services I would not be surprised at all at them silently losing data, I would not even be sure for them ever noticing that they have lost data.

For anyone using their storage as backups, I recommend having an inventory (with hashes) and comparing that to whats in MS from time to time. At least you know you are not going crazy, even if MS never trusts your proof about their incompetence.


Hell, github has had a lot of outages in the last weeks (webhooks for the most part), I would not be surprised if they'd lose github data .. which I guess would be detected by git eventually but hey, what's gone is gone.


Side rant: OneDrive is so much worse than DropBox. But if you’re on MS and use Word or Excel with other people there is no way around OD.

Same with Teams. It’s easily the worst piece of SW on my computer, closely followed by OneDrive.


I can highly recommend uninstalling teams and using the web version instead.

It's a bit less broken and crashes much less often, but the most important thing is that you can just F5 it when it does, which is much faster than starting up the desktop app.


Interesting. Which browser do you recommend?


I used firefox without any issues, but today was the first time I tried to make a call instead of joining a meeting. The site said that my browser didn't support that (meetings work perfectly, so that's obviously BS) and that I should use edge or chrome.

So edge and chrome are the only options. Still better than having audio stop working in almost every call and constant crashes.


Are you sure you understand how VEEAM MS365 actually backs up data?

They designed it to look like a differential backup model, but it's not. In fact the only backup you can trust as to its integrity is the first full one. https://helpcenter.veeam.com/docs/vbo365/guide/retention_pol...

“When a retention policy is applied in backup repositories with the Snapshot-Based Retention type, Veeam Backup for Microsoft 365 removes versions of an item, but not an item itself. Data removal from backup occurs every time the restore point of an item's version in a backup file goes beyond the retention coverage. Eventually, if no more changes were made to an item, Veeam Backup for Microsoft 365 will remove all versions of an item except the latest one. The latest versions and items that were never changed stay in a backup repository with the Snapshot-Based Retention type forever.”


> They designed it to look like a differential backup model

> Eventually, if no more changes were made to an item, Veeam Backup for Microsoft 365 will remove all versions of an item except the latest one.

I added the emphasis. I read that to mean I can expect the current point-in-time (aka right now) to be identical to my live data. I'm only trying to reconcile the most recent set of data, so that retention policy shouldn't make any difference, right?


> Veeam

As my 2 year old would say, "uh oh."

Are you saying your backup solution is:

- there's a folder somewhere whose contents are "synchronized", crying laugh emojis galore, using a "Dogshit" protocol, as it appears in the programs Microsoft OneShit or DropShit or Shit.net or Google Shit.

- you have a machine that has a "copy" of a dogshit-synchronized file system, perhaps using a vendor's implementation of Dogshit, such as Synology Dogshit Manager, a piece of software with which I am intimately acquainted

- that machine "regularly" "backs up" its "copy" via a procedure known confusingly as "dogShit"

- wait a minute, the files aren't right!

I hear you. This is very surprising, but it would occur no matter which file sharing system you use, so long as it's based on Dogshit and/or dogShit.

What is the specific flaw with dogShit? The simplest explanation for everything you observe, like the constant moving by multiple users, is that while using ordinary file system enumerations of the form "walk," a folder may be moved in a way such that it is never visited by "walk," even though the destination of the moved folder is still within the root directory of the walk. You can easily verify this in Python.

Okay, so now that I solved your bug in dogShit, what should you use instead? Snapshotters have already solved this problem, you have to create a dedicated filesystem for the shared directory on your Synology, then snapshot the whole filesystem, which will only have issues with "missing" since the moment it started but never traversing the filesystem incorrectly. Then you can backup the file system snapshots to S3. In Synology you can achieve this with btrfs and some elbow grease.

But the right answer is to not use dogshit. At the end of the day the people who are authoring Dogshit file sharing products, they should be giving you a backup approach, not demanding you do it ad-hoc.


> They designed it to look like a differential backup model, but it's not.

The quote you pasted makes it sound like a reverse-differential incremental backup, like rdiff-backup implements (poorly). The most recent backup is a snapshot of the tree, restore points having changes are diffs applied cumulatively in reverse chronological order atop the current snapshot. You can simply toss out restore points starting from the tail older than the retention policy.

How is that not a differential backup model?


In case anyone comes back to this, it's because any snapshot who's retention date is out of the retention period can get removed. When the original item is out of the retention period it can also get removed.

There is no cumulation of diffs, its just diffs from the original object.

You might notice that the quoted text of my original post has also been changed on the link.


I can't edit anymore, but want to clarify. Files are missing from my backups, not from OneDrive. The backup software fails to reproduce the data in OneDrive.


Are those files as 'always keep offline'? If not, the backup process might not trigger a download.


You're probably thinking along the lines of pointing a file level backup solution at the OneDrive folder, right? This isn't like that. They're commercial solutions that are configured as an Azure App, so they always work against the online data.

I know what you're saying though. If you took something like Arq Backup and pointed it at you're OneDrive folder, I don't know how it would work, but suspect it would fail with files-on-demand because it takes a VSS snapshot to get a stable point-in-time. I've never tested, but assume working off a snapshot doesn't trigger the download for files-on-demand and it would feel broken to anyone that doesn't realize what's going on.

That's a good observation, but unlikely to be related to what I'm seeing here.


That's terrifying. I'm going to spot check our N-able backups more closely


My OneDrive (for Mac) woes are many:

- When saving a file, it's sometimes not possible to not have access to the file immediately after saving. This is super fun when you're writing software and you save and then can't compile it.

- The setting to keep every file locally doesn't stop files being ejected and then not kept locally.

- Files that are not local take very long to download (e.g. an 85kb file just now too ~60s on a 1GB fiber connection)

- One drive creates duplicate folders and files during syncing errors

- Some bug in making changes to a word doc and then renaming it made the new changes inaccessible by either the old or new name. This was an old issue, but was particularly stressful when I was working on a ~50 page document with a large amount of edits across the entire doc, that had a hard deadline for submission.

- For a period of time I was unable to login with my account to OneDrive using the Mac software. Support could not give me any indication about why the error was occurring and were unwilling to continue diagnosis - the suggestion was to effectively wipe my user profile and start over

- Eventually in some update the sync client started working, and promptly deleted every file in my OneDrive folder before pulling it down again (100s of GB). A stressful few days of wondering which files would be kept / lost at the end

- Sync speed is reduced to 10s of kbits/s when you have millions of files (on a 1Gbit/s connection this is frustrating)

- Putting any git repo or a nodejs project in OneDrive is problematic due to the millions of files issues, generally results in locking up the process

- No tooling to understand how much syncing time is left and see the overall status of large syncs

- No user accessible user interface / logs to help understand why things are syncing / what is happening under the hood (this wouldn't really be necessary if the rest of the software wasn't so buggy).

There are many more examples I could go into all of little things that OneDrive does that a buggy, broken, or terrible. But my general impression is that the quality of the product (at least on the Mac side of things) is not sufficient to be fit for purpose. OneDrive really has 1 main job: make the data I save available when I need it. If fails at that regularly.

In summary, I wouldn't trust OneDrive to not lose data (but haven't yet moved off it due to inertia and the effort required). I would recommend that other users choose anything other than OneDrive if you care about any of your data long term.


The underlying issue is going to be super benign. Like Veeam probably does a `walk`, which is a snapshot of a directory structure in time, and then a user moves a directory, which doesn't cause Veeam to invalidate its entire `walk` thus far, so that directory is never visited.

https://news.ycombinator.com/item?id=37411554


This is a pretty low effort comment by me, but what is a backup solution you can’t verify really worth? GNU tar has —compare. In some ways we are inundated by amazing advances in tech, in other ways… not so much.


Just one?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: