Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Better duplicate identification
17 points by RiderOfGiraffes 2855 days ago | hide | past | web | 17 comments | favorite
What if each submission caused the HN engine to download the title from the referenced page. It's exceedingly unlikely that there'll be two different useful pages with the same HTML title, but whenever there's a duplicate submission, the title should be the same.

It's previously been suggested that duplications could/should be detected by looking at the page source, but surely it would be enough 99.9% of the time just to check for duplication of the <title> contents.

Thoughts? Problems? Counter-suggestions?

The NYT, Huff Post, and others do LOTS of experiments with title and click through rate measurements. It is very likely that a submitted page will have a different title later.

I didn't know that. I would hope that by the time the title gets changed the potential contributor would have seen the earlier contribution.

Mind you, the evidence is against me there ...

Still, perhaps most sites have an unchanging title, so perhaps it will still work enough.

I think if you make it an initiative for the submitters to submit the lowest common demoninator, we would rid ourselves of the duplications. I have to believe everyone on Hacker News knows how URLs work and can see that something like this:

is not needed as part of the submission URL.

And to cover all basis, if you don't know if it is needed try it before you submit it to HN. Delete the above portion of the URL, hit enter and if you still get the same webpage you have done your due diligence. Finding a correct algorithim for this might be tough, but if we put the pressure on the submitters themselves we might get something done.

Well, case in point:



The lunk to items have the same title, but the URLs are different becuase of the extra, unnecessary parameters. The extra check would have prevented the duplication.

But if you put the onus on the submitter, it won't change. After all, two of the recent duplications were duplicates of items on the front page, not even obscure ones.

There's no penalty for people who don't bother. Automated checking goes a long way. The only way you'll get people to do the checking before they submit is if there's a penalty. Perhaps duplicated items should cause a loss of karma. Pretty draconian when we want good items. People might stop bothering.

Checking the title seems easy to implement, gain a lot, and then we can see if the problem/situation remains.

And I think this is a problem, because sometimes the discussion can get diluted across multiple submissions, leading to duplication, which is a waste of time and effort. As a hacker, I resent that.

Maybe borrow a feature from Stack Overflow, which presents similar-sounding stories before posting? Some client-side JS which pulls the page, heuristically picks some "interesting" words, and runs them through search.yc?

Makes sense to me. It also seems that a good part of the dupes come from the fact that the full URL is used as the identifier. And often times that URL has a lot of referrer garbage and similar data in it. It would seem there could be some logic to try and pare URLs down somewhat in the comparator code.

For some pages the "stuff" that follows the obvious root URL is in fact required. There are no easy rules for paring down the URL - that would need to be done by hand, and most people who submit items don't take the time to do that.

Sometimes it IS certainly required, but many times there are extraneous switches tacked on, which follow standard HTML patterns.

Take this URL for example: http://www.wired.com/thisdayintech/2009/12/1223Shockley-Bard...

The question mark is the demarcation point between the story URL and the "garbage". While some cases may exist, I don't recall recently seeing a URL of the format: http://mydomain.com/foo/storyFetcherScript?storyID=TheOneAbo...

Even so, one possibility would be to compare URL's up to the first question mark, or other obvious demarcation characters, and if a dupe is found ask the submitter to verify their submission is unique. Like most other things at HN, I think that for the time being an honor system yes/no question is all that would really be needed to weed out 95% of the dupe submissions.

It should also help supplement the spam filtering. Taking a quick look at /noobstories, a bunch of these would die solely from this. Which should in turn make the /newest page somewhat better, which can only be a good thing.

Maybe a good way to identify duplicates could be a comparison of word frequencies (or frequencies of 3 word sentences or something like that)? Perhaps that way one could eliminate the "layout" and identify the core text.

I would like to create a library for that, if I could find the time :-/ Maybe then sites like HN would have an easier time to include the functionality.

Not sure if counting frequencies would be enough to identify the duplicates that have been filtered through translation. Interesting stuff.

Natural Language Toolkit for Python: http://www.nltk.org/Home

It wouldn't be hard to use that to come up with a quick "fingerprint" of a page, consisting of word frequencies. The chances of two different pages having the same fingerprint would be exceedingly small.

Given that so many sites use some form of active content I should think that, unless designed very, very carefully, two separate loads of the same page could easily produce different "fingerprints." I've considered that and rejected it as too hard for a five minute trial, whereas the test against the title seems to work very well.

I think there has to be some filtering out of irrelevant words, though. Rather than a unique fingerprint, there would have to be a measure on similarity.

But I'll have a look at the NLTK, thanks for the link!

Hmm. What evil could you wreak with a service for making the HN server connect to whatever website you want?

Doing a curl, parsing the return for the <title> field, then saving the string for comparison if the title field exists seems pretty safe. I'd be interested to see an exploit.

And as it stands, allowing the submission endangers us all, rather than writing something designed to be tolerant and preventing any such exploit.

Still, an interesting question.

Are dups that big of a problem?

They certainly occur a lot, and I personally find them annoying:


The system already culls many dups, it would be interesting to know how many, by comparing the URL of a new submission with those of earlier. If you try to resubmit something then it counts as an upvote for the original.

Try it. Pick something already submitted, then try to submit the exact same URL. I don't know how often it gets invoked, though, but the simplistic logic misses many dups.

Of course, just because it happens a lot (for some value of "a lot") that doesn't automatically make it a problem.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact