

Ask HN: Do image hosting sites remove redundant images? - mgallivan

I've started reading: http://www.aosabook.org/en/distsys.html<p>And his first example is that of an image hosting website.  My mind went on a bit of a tangent and I was curious...<p>Do image hosting sites keep a hash of their images and only store 1 copy of each image?  I'm not sure the effort it would take to ensure there was a single copy of each image but... it seems like it would save some space.<p>Does anyone know?
======
relix
I don't think so. The average filesize of a JPEG image on the web is what,
50KB? At those sizes, it's just not worth it to put a system in place that
could introduce more bugs through extra complexity. Especially since duplicate
images would be relatively rare, as in 0.00001% rare. Completely not worth it.
You'd still need extra database entries too, so you're not even saving on the
overhead, database size or queries.

Look at it this way: during the time spent coding this feature, I'm pretty
sure disk drive space would grow more than the extra space you'd need for the
duplicates (relatively speaking over the long-term of course).

------
a_bonobo
The Imgur-creator recently did an AMA where someone asked the exact same
question:

>do you hash and store only one copy of duplicate images?

>Believe it or not, we don't. All the images only use up about 3TB of storage
space, so it's not really a big issue.

Source:
[http://www.reddit.com/r/IAmA/comments/y81ju/i_created_imgur_...](http://www.reddit.com/r/IAmA/comments/y81ju/i_created_imgur_ama/?utm_source=dlvr.it&utm_medium=feed)

On the other hand, YouTube stores 76 PB:
[http://www.afshispeaks.com/2012/08/youtube-storage-costs-
per...](http://www.afshispeaks.com/2012/08/youtube-storage-costs-per-year/)

------
UnoriginalGuy
Maybe they do and don't even know it. Some database engines will compare
blobs, and then delete duplicates.

~~~
true_religion
No serious image host is going to stick images in the database.

~~~
macowar
Well, technically a file system is a specialized typed of database. NTFS has
some support for removing duplicate files and replacing them with links.
Windows server 2008 calls this feature "Single Instance Storage".

