Hacker News new | past | comments | ask | show | jobs | submit login

Very useful for identifying files that may need to get deduplicate or that can be removed entirely. Unfortunately, I don't think this will also find identical directories.

If deleting files isn't what you want, I'd suggest looking into deduplicating tools.

ZFS has its own de duplicator built in, which is nice. It should just deduplicate files and individual extents of files by itself once you enable it. Probably not a good idea on very write-heavy disks, but it's an option.

Other file systems with extent level deduplication can use https://github.com/markfasheh/duperemove to not only deduplicaye files, but also deduplicate individual extents. This can be very useful for file systems that store a lot of duplicate content, like different WINE prefixes. For filesystems without extent deduplication, duperemove should try hard linking files to make them take up practically no disks space.




Yes

Hardlinking files would be a dangerous idea, in my opinion

What is good with the FIDEDUPERANGE ioctl is that everything is transparent from the user space point of view: whatever are the files, whatever they are used for, nothing have changed

When you hardlink, you see that, at the current moment, files are the same, but then you assume that userspace wants those files to stay the same

To me, this sounds like a recipe for a disaster

This ioctl lives at the vfs layer. As of now, btrfs and xfs has implemented it. There is a merge request lurking around for zfs, and I have yet to check what is bcachefs' status on this The only top player left is ext4 :/

(disclosure: I work on duperemove)


> Unfortunately, I don't think this will also find identical directories.

Generate a hash over the list of hashes for a directory's content. That would allow you to detect identical directories. That directory hash is rather volatile and would need regeneration every so often, but that shouldn't be a major problem.


> ZFS has its own de duplicator built in, which is nice. It should just deduplicate files and individual extents of files by itself once you enable it. Probably not a good idea on very write-heavy disks, but it's an option.

It's also a memory hog, and I feel like there were other caveats but I don't remember for sure


I sent a pull request to zfs that adds support for fiduperange, which makes all these tools work without having to turn on the large scale online deduplication. Instead it uses block cloning. Test it if you are willing!


You can now dedicate an SSD or something fast to holding the deduplication tables so it's not a memory hog anymore


Huge memory impact, that never goes away - if you turn of dedup, you still need the massive deduplication map in ram for things that already where deduplicated.


Actually, you can, if you do offline deduplication, because memory can be replaced by storage


Interesting! I did not know that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: