
De-duplicating Hacker News - shishir456
http://shishirprasad.com/de-duplicating-hacker-news/
======
ISL
De-duplicating might have a couple of downsides:

1) Not all good stories get taken up the first time.

2) Not everyone reads every story on HN.

After a couple of years of reading HN, I'm happy to see quality posts
reappear. I often glean new insights each time. If it's interesting to enough
people, it'll bubble up again. The beauty of HN's "gravity" is that anything
that's universally boring will disappear quickly.

"Anything that gratifies one's intellectual curiosity" needn't mean, "Hasn't
been posted ever before."

If, however, a de-duplicator could automatically provide references to all
previous HN discussions on the same topic, that'd be very cool.

~~~
_asummers
One place a tool like this would be useful is on submission. If I post a link
and something comes up that says "HEY! This was last posted 2 days ago and got
300 upvotes and had 45 comments, here's the link.", that would discourage
reposting when the poster didn't know the article had been posted before.

~~~
eli
I thought it already did that.

~~~
_asummers
Oh, maybe. I guess I've never submitted a link before to know that. Even
still, if that algorithm can be improved, even slightly, it'd be better for
curbing repeat posts.

------
avinassh
> Compute pair-wise Jaccard similarity of all articles with each other and
> output the articles whose title similarity is greater than 0.5.

If I understand correctly, it checks the similarity between titles. How
effective it will be if it also checks similarities between the article
contents? Sometimes, the article may not have similar looking titles, but
talking about same thing.

Example: [0], [1], [2]

OT: You can use Python haxor to get articles from the HN [3]. Disclaimer: I
wrote it.

[0] - [http://www.bloomberg.com/news/articles/2015-07-17/google-
app...](http://www.bloomberg.com/news/articles/2015-07-17/google-appreciation-
day-as-record-60-billion-is-added-to-stock)

[1] - [http://www.cnbc.com/2015/07/17/googles-one-day-rally-is-
the-...](http://www.cnbc.com/2015/07/17/googles-one-day-rally-is-the-biggest-
in-history.html)

[2] -
[http://www.bbc.com/news/business-33572959](http://www.bbc.com/news/business-33572959)

[3] - [http://github.com/avinassh/haxor](http://github.com/avinassh/haxor)

------
brudgers
My understanding of the article is that this is more about reducing
duplication on breaking stories rather than preventing periodic reposting of
long tail and evergreen content. That is, the proposal appears focused on the
"news" end of the spectrum rather than the "feature" end.

It's a reasonable request but one I tend to disagree with. Funneling all
attention on one version of a breaking story reduces nuance and diversity of
perspective. It also rewards fastest posting over finding the best content
while taking the community out of the calculus of story value. Finally, to me
the dispersion of karma awards across multiple submissions is a feature not a
bug...karma should tend toward rewarding quality rather than timing where an
explicit mechanism is in operation.

Finally, there are occasions where a story is deemed to merit the full
attention of the community. The deaths of Steve Jobs and Dennis Ritchie are
examples.

------
Houshalter
I think this is a bad idea. Last week I posted a link here and by random
chance no one saw it, and it didn't get any points.

A few days later the same story ended up at the top of /r/machinelearning. At
least 10 different people tried to post it. They all ended up at my post which
was days old and dead. If HN just allowed resubmissions, it probably would
have ended up on the front page that day.

~~~
moridin007
how do you know that?

~~~
Houshalter
Posts with >5 upvotes tend to go to the front page for at least a few minutes.
More than that many people tried to repost it.

~~~
dang
What was the post? I'd be curious to look into it, if you don't mind saying.
(Edit: might be best to email hn@ycombinator.com.)

~~~
Houshalter
It was
[https://news.ycombinator.com/item?id=9856637](https://news.ycombinator.com/item?id=9856637)

~~~
dang
Ok, there was another factor. When a site is the source of many stories marked
lightweight, it eventually gets penalized as a lightweight site. We have
software that does this, plus moderators do it.

This was the case here: that site has been the source of, not spam exactly
(which is why it isn't banned), but a lot of unsubstantive and/or derivative
articles. The post you submitted is an example of the latter, since it was
derived from a Reddit thread.

The penalties I'm talking about don't make it impossible for a story to get
traction, but they do set the bar higher. So you were right that it was
randomness, but the randomness was also skewed.

------
kitd
Looks like the site's been hacked now.

------
ChuckMcM
Interesting approach, a slightly simpler approach is to just take the MD5 hash
of paragraphs. Two paragraphs with the same hash are likely identical, and two
articles with 2 or more identical paragraphs are likely a dupe.

So as a suggestion try that algorithm with your current infrastructure and let
us know how it compares to the Jaccard similarity test.

~~~
gus_massa
Some blog have standard end paragraph like "If you have read all of this, you
may like to subscribe to my rss", or "We are always hiring at ABC, send your
resume." Another problem are short captions that look like a paragraph for the
html parser, like "Advertisment" or "XYZ Benchmark (higher is better)". One
possible solution is to skip the paragraphs that have less than ¿150? letters.

~~~
ChuckMcM
I agree that it is quite reasonable to ignore paragraphs that are fewer than 3
sentences.

------
protomyth
Nice, but do remember that dang has asked people to resubmit some links. Which
is an interesting variable of did the initial link actually gain traction
before it sunk into obscurity with no adding of comments.

------
zeckalpha
Duplicate content is de-duplicated to an extent. Oftentimes, if I forget where
the HN story that corresponds to a page I'm reading, I resubmit it, and HN
redirects me to the HN story.

I believe this is time-boxed by a few weeks, which allows for things like
"Something interest from the past (2008)" to get posted again.

------
tmalsburg2
If duplicates bother you, this is a sure sign you're spending too much time on
HN.

------
jbi
The problems mentioned can get real. Other simple solutions could be:

\- Allow to merge threads using special comments. Once a comment with a merge
request gets 'a lot' upvotes / more than the corresponding thread, it gets
merged.

\- Add an 'also on HN'-block with all threads linked inside the comments.

\- Allow to create compositions in submit-function. This could also create a
potential for meta-HN-content, e.g. 'links with great discussions'.

IMHO: HN is nice because it does not have a lot on functionality. It just
works. Complicating things isn't a solution. It's human-driven in contrast to
an automatic news aggregator and it should stay human-driven.

------
linhchi
I'd just hope that the duplicate threads are merged together so I can see all
comments because comments can't be duplicated.

I appreciate members that are solving the duplicate problem manually by
pasting the links to the same post. May be there can be "Report as duplicate"
so that admin can merge them together.

------
whelchel
When it comes to changing things on HN, I think something like comment
collapsing would be much more helpful than de-dup (as some of the other
comments mention), though I know there are extensions that help out.

~~~
krapp
That is, apparently, eventually, on the way, as is a mobile-friendly layout.

The staff are probably not looking forward to the day they deploy it and half
the users lose their minds with impotent rage because the shibboleth of
awkward UX is no longer there to drive away the casuals. Heaven help us if
they implement thread folding in javascript...

~~~
DanBC
An optional [https://m.news.ycombinator.com/](https://m.news.ycombinator.com/)
with bigger font, and bigger vote buttons with better separation between them,
would allow them to test how many users want a different UI.

------
zobzu
one issue i see is not the dupes but is that its decently easy to self-vote
from several "accounts" and "ips" to be on the FP for a little while.

the other one would be that due to the fast-paced news, only a few per day are
really interesting - and you dont look every hour you might even miss it
(which is either way not very productive)

------
sneak
It ain't broke.

------
kaushikfrnd
url seems broke can you check

~~~
shishir456
Ya seems to have been hacked or something :(. Let me look into it !! The
parent site works and you can read the article there :
[http://shishirprasad.com/](http://shishirprasad.com/)

------
crystalgiver
Why can you not just use the canonicalized URL to detect dupes? That is
infinitely simpler than doing text analysis.

~~~
shishir456
It will work for simple cases like https vs http or other cases of URL
normalization but won't work for complex cases where they refer to the same
content but with different title.

~~~
eridal
I think it could work with the canonical tag[1], not the url itself.

[1]
[http://googlewebmastercentral.blogspot.com.ar/2009/02/specif...](http://googlewebmastercentral.blogspot.com.ar/2009/02/specify-
your-canonical.html)

