
Organizing Data Through the Lens of Deduplication - anishathalye
https://www.anishathalye.com/2020/08/03/periscope/
======
anishathalye
Hi HN! Recently, I decided to take care of a task I had been procrastinating
for a while: to organize and de-dupe data on our home file server. I was
thinking of it as a mundane task that needed to get done at some point, but
the problem turned out to be a bit more interesting than I initially thought.

There are tons of programs out there designed to find dupes, but most just
spit out a huge list of duplicates and don't help with the work that comes
after that. This was problematic (we had ~500k dupes), so I wrote a small
program to help me. The approach, at a high level, is to provide duplicate-
aware analogs of coreutils, so e.g. a `psc ls` highlights duplicates and a
`psc rm` deletes files only if they have duplicates elsewhere.

I thought it was a somewhat interesting problem and solution, so I wrote a
little write-up of the experience. I'm curious to hear if any of you have
faced similar problems, and how exactly you approached organizing/de-duping
data.

