

How to prevent duplicate submissions-- a standard way to specify URL equivalency? - joshwa

Nytimes urls got me again this morning-- didn't mean to submit a dupe, and HN's dupe catcher didn't catch it.<p>What if in robots.txt, there was a way to specify which query parameters in a URL are unique to the content?<p>e.g.<p><pre><code>  UniqueContentURLParameters: "articleID","node"
</code></pre>
would mean that dupe checkers would know to ignore those parameters that don't match any of the items in the list.<p><pre><code>  http://news.ycombinator.com/robots.txt:
  UniqueContentURLParameters: "id"

  http://nytimes.com/robots.txt:
  UniqueContentURLParameters: none
</code></pre>
Of course, this doesn't solve the problem of bad URL design, such as the BBC news site, where you have<p><pre><code>  http://news.bbc.co.uk/2/hi/asia-pacific/7391008.stm
</code></pre>
and<p><pre><code>  http://newsvote.bbc.co.uk/mpapps/pagetools/print/news.bbc.co.uk/2/hi/asia-pacific/7391008.stm
</code></pre>
I guess you could specify a list of regexes?<p>Alternatively to a robots.txt, I could see a shared database for linksharing/social news sites to accumulate this information.<p>How would you solve this problem?
======
raghus
wrt robots.txt, what is the incentive for site owners to prevent duplicate
submissions? As a site owner, wouldn't I be happy if both
<http://www.myadsenseblog.com> and <http://myadsenseblog.com> get submitted?
2x15 minutes of fame!

~~~
joshwa
* makes it easier to track if your content has gone viral or not

* votes aren't split across multiple copies: a few votes here, a few there, and maybe you could have made the front page

* users are less pissed off (looking at you, nytimes and bbc)

------
bkovitz
Multiple regexes sounds fine. When you discover a new kind of duplicate, add a
new regex.

