Hacker News new | more | comments | ask | show | jobs | submit login

How could Google use hashes to avoid duplication? They'd have to download each link before they could hash the contents thereof, so the damage would still be done.

The damage could be 3 downloads per Google Document. If 3 downloads produce 3 similar hashes then start limiter/throw up capchta/delay to avoid heavy intra-document duplication.

How could Google use hashes to avoid duplication?

Rate limit per website (e.g. don't download more than 10 images per domain per second)

Limit the total number of images it downloads per document, so a single user can not cause too much traffic.

In that case, users may notice a performance decrease in spreadsheets for images from certain websites.


(I know that servers can be configured not to send ETags or break caches by sending random ones every time, but this could reduce the data usage considerably since most of the responses would only include the headers.)

The query parameters make each request different. Etags are not unique across the internet - just for a specific url. There is no way an etag would help here, unless the same request is made later. Even making a request with an Etag still means lots of headers returned which while not 10MB will add up to lots of traffic.

But they could hash the filename (a hash prevents accidental disclosure of content).

Hashing the filename doesn't help, the URL is different, which is why caching doesn't work.

If we ignore that ETags are related to URLs and not 'files', ETag as suggested by userbinator might work for some cases, but if the large file is dynamically generated, it's unlikely to have an ETag; defaults in many servers are to make an ETag based on the inode of the file rather than any properties of the file, so if there are multiple servers behind a load balancer, they're likely to return different ETags.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact