

Ask HN: Better duplicate identification - RiderOfGiraffes

What if each submission caused the HN engine to download the title from the referenced page.  It's exceedingly unlikely that there'll be two different useful pages with the same HTML title, but whenever there's a duplicate submission, the title should be the same.<p>It's previously been suggested that duplications could/should be detected by looking at the page source, but surely it would be enough 99.9% of the time just to check for duplication of the &#60;title&#62; contents.<p>Thoughts?  Problems?  Counter-suggestions?
======
pierrefar
The NYT, Huff Post, and others do LOTS of experiments with title and click
through rate measurements. It is very likely that a submitted page will have a
different title later.

~~~
RiderOfGiraffes
I didn't know that. I would hope that by the time the title gets changed the
potential contributor would have seen the earlier contribution.

Mind you, the evidence is against me there ...

Still, perhaps _most_ sites have an unchanging title, so perhaps it will still
work enough.

------
joshfinnie
I think if you make it an initiative for the submitters to submit the lowest
common demoninator, we would rid ourselves of the duplications. I have to
believe everyone on Hacker News knows how URLs work and can see that something
like this:

    
    
        ?utm_source=feedburner&utm_medium=twitter&utm_campaign=Feed%3A+startup%2Flessons%2Flearned+%28Lessons+Learned%29
    

is not needed as part of the submission URL.

And to cover all basis, if you don't know if it is needed try it before you
submit it to HN. Delete the above portion of the URL, hit enter and if you
still get the same webpage you have done your due diligence. Finding a correct
algorithim for this might be tough, but if we put the pressure on the
submitters themselves we might get something done.

~~~
RiderOfGiraffes
Well, case in point:

<http://news.ycombinator.com/item?id=1012670>

<http://news.ycombinator.com/item?id=1012341>

The lunk to items have the same title, but the URLs are different becuase of
the extra, unnecessary parameters. The extra check would have prevented the
duplication.

But if you put the onus on the submitter, it won't change. After all, two of
the recent duplications were duplicates of items on the front page, not even
obscure ones.

There's no penalty for people who don't bother. Automated checking goes a long
way. The only way you'll get people to do the checking before they submit is
if there's a penalty. Perhaps duplicated items should cause a loss of karma.
Pretty draconian when we _want_ good items. People might stop bothering.

Checking the title seems easy to implement, gain a lot, and then we can see if
the problem/situation remains.

And I think this _is_ a problem, because sometimes the discussion can get
diluted across multiple submissions, leading to duplication, which is a waste
of time and effort. As a hacker, I resent that.

------
sp332
Maybe borrow a feature from Stack Overflow, which presents similar-sounding
stories before posting? Some client-side JS which pulls the page,
heuristically picks some "interesting" words, and runs them through search.yc?

------
brk
Makes sense to me. It also seems that a good part of the dupes come from the
fact that the full URL is used as the identifier. And often times that URL has
a lot of referrer garbage and similar data in it. It would seem there could be
some logic to try and pare URLs down somewhat in the comparator code.

~~~
RiderOfGiraffes
For some pages the "stuff" that follows the obvious root URL is in fact
required. There are no easy rules for paring down the URL - that would need to
be done by hand, and most people who submit items don't take the time to do
that.

~~~
brk
Sometimes it _IS_ certainly required, but many times there are extraneous
switches tacked on, which follow standard HTML patterns.

Take this URL for example:
[http://www.wired.com/thisdayintech/2009/12/1223Shockley-
Bard...](http://www.wired.com/thisdayintech/2009/12/1223Shockley-Bardeen-
Brattain-
transistor?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Index+3+%28Top+Stories+2%29%29&utm_content=Google+Feedfetcher)

The question mark is the demarcation point between the story URL and the
"garbage". While some cases may exist, I don't recall recently seeing a URL of
the format:
[http://mydomain.com/foo/storyFetcherScript?storyID=TheOneAbo...](http://mydomain.com/foo/storyFetcherScript?storyID=TheOneAboutTheDonkey)

Even so, one possibility would be to compare URL's up to the first question
mark, or other obvious demarcation characters, and if a dupe is found ask the
submitter to verify their submission is unique. Like most other things at HN,
I think that for the time being an honor system yes/no question is all that
would really be needed to weed out 95% of the dupe submissions.

------
riffer
It should also help supplement the spam filtering. Taking a quick look at
/noobstories, a bunch of these would die solely from this. Which should in
turn make the /newest page somewhat better, which can only be a good thing.

------
Tichy
Maybe a good way to identify duplicates could be a comparison of word
frequencies (or frequencies of 3 word sentences or something like that)?
Perhaps that way one could eliminate the "layout" and identify the core text.

I would like to create a library for that, if I could find the time :-/ Maybe
then sites like HN would have an easier time to include the functionality.

Not sure if counting frequencies would be enough to identify the duplicates
that have been filtered through translation. Interesting stuff.

~~~
scott_s
Natural Language Toolkit for Python: <http://www.nltk.org/Home>

It wouldn't be hard to use that to come up with a quick "fingerprint" of a
page, consisting of word frequencies. The chances of two different pages
having the same fingerprint would be exceedingly small.

~~~
RiderOfGiraffes
Given that so many sites use some form of active content I should think that,
unless designed very, very carefully, two separate loads of the same page
could easily produce different "fingerprints." I've considered that and
rejected it as too hard for a five minute trial, whereas the test against the
title seems to work very well.

------
swolchok
Hmm. What evil could you wreak with a service for making the HN server connect
to whatever website you want?

~~~
RiderOfGiraffes
Doing a curl, parsing the return for the <title> field, then saving the string
for comparison if the title field exists seems pretty safe. I'd be interested
to see an exploit.

And as it stands, allowing the submission endangers us all, rather than
writing something designed to be tolerant and preventing any such exploit.

Still, an interesting question.

------
andreyf
Are dups that big of a problem?

~~~
RiderOfGiraffes
They certainly occur a lot, and I personally find them annoying:

<http://searchyc.com/Dup?sort=by_date>

The system already culls many dups, it would be interesting to know how many,
by comparing the URL of a new submission with those of earlier. If you try to
resubmit something then it counts as an upvote for the original.

Try it. Pick something already submitted, then try to submit the exact same
URL. I don't know how often it gets invoked, though, but the simplistic logic
misses many dups.

Of course, just because it happens a lot (for some value of "a lot") that
doesn't automatically make it a problem.

