

Hacker News "Submit" functionality needs a Digg-like duplicate alert - jharper

Do you guys agree?  I hate posting dupes ... but this site doesn't exactly discourage it.
======
brk
Usually when I've submitted a dupe story, if it already exists it just adds a
vote to the pre-existing story.

Where the logic either breaks (or allows people to subvert it) is where
different URL's can get you to the same story. Often times URL's contain some
superfluous flags that don't change the content served, but just serve to log
some referrer or layout type data (I'm sure everyone reading this site gets
how this works). Adding, removing, or changing any of this data seems to
pretty much break or confuse whatever dupe-detector logic exists.

~~~
jharper
Human dupe-detection would be an excellent extension to this process.

~~~
Xichekolas
Why not just compare the _content_ at the other end of the link with the
contents of existing links.

It wouldn't be that hard. Whenever a link is submitted, YC's server would
visit the link, get the response, strip all html tags and white space from it,
then hash whatever is left. It would then store this hash value with the link.
Whenever a new story is submitted, it is likewise hashed and then a check is
made for an existing link with the same hash value. If it exists, it's a dupe,
if not, allow it.

This would be an extra check to the existing dupe URL string of course. It
still wouldn't catch every single thing, but it should eliminate quite a few
easy dupes.

If that turns to have a low success rate, try hashing the page title or maybe
the http headers.

~~~
joshwa
A single comment or timestamp would change the hash.

Maybe the <title>, or the contents of the first <h1> or something would be a
better proxy.

~~~
Xichekolas
Yeah that is what I was thinking when I added that last line.

For some reason I initially wasn't thinking about comments... so the title
would be a much better proxy.

------
pg
It catches most dupes, just not on sites like the NYT that have so many
different urls.

~~~
far33d
Maybe you could check if the title of the page is the same as well, not as an
automatic detection, but instead as a reason to ask "are you sure this isn't
the same as foo". This might prevent most of the NYTimes dupes.

~~~
derefr
Would it be that hard to just take a fulltext index of each page that hits the
hot page? From there, just show anything with some >N% similarity (probably 98
or so, as text ads can affect the site a little bit.)

~~~
ph0rque
With your comment, I saw for the first time how the semantic web might be
useful.

------
brlewis
Sounds like a nice-to-have feature. I don't think it's that bad to delete your
own post if you notice the dup soon enough. No skin off your nose if the
moderators kill it, either. That's happened to me.

------
jakewolf
Hah, I beat you by 1/2 an hour <http://news.ycombinator.com/item?id=144730>

~~~
manvsmachine
If I notice it early enough, I usually do what you just did and post the link
to the original submission as a comment.

------
chaostheory
the only thing I noticed before (not sure if it's still valid) is that
www.website.com is a dif submission than website.com

~~~
jey
Those aren't the same URLs; the dupe detector currently only rejects exactly
duplicated URLs.

------
tim2
The way they have it executed is incredibly annoying.

