

Tell HN: After six days of DupDetector, time to take stock. - DupDetector

I've been running DupDetector for just under a week now.  The results have been interesting.  Many, many duplicates have been found, and many, many more items have been cross-referenced.  So often the same story is reported again and again from different sources, and while this isn't a bad thing in itself, my personal belief is that the resulting divided discussions is a waste of time and effort.  And wasted time is something I hate.<p>I come to HN for high-quality links, and even more high-quality discussion, and anything that dilutes that discussion is, to me, a bad thing.<p>I originally intended to run the DupDetector for a week, but after 5 1/2 days there's enough information to tell me what I want.  The thing that has caught me by surprise is the way it has had such widely differeing reactions.  Some people have accused me of "a novelty account", something I'd never heard of, but which appears to be associated with Reddit.  I thought of it more as a robot assistant.<p>In particular, I thought more people would be interested in the technology and the hacking.  More than anything it's the lack of a response on that level that's made me pause.<p>And the final factor is today, when DupDetector's karma has fallen from 27 to 12.  I don't care about the karma, but it's an indication of people's feeling about the exercise.<p>I'll run it for just a few hours longer as I tidy things up, but basically  I'm stopping, explaining, and I'll see what the response is.  I thought it would be, and it would be thought to be, cool, interesting and useful.<p>Maybe I've misjudged my audience.  I am reviewing the situation.
======
blhack
This is something that I seem to disagree with a lot of people on. Dupe-police
have existed on reddit, digg, slashdot, here, and just about every other
social bookmarking website I've ever used (including my own).

I _hate_ them. Why? Because the entire point of social bookmarking is to find
things you find interesting, _not_ things that you find unique. That little
arrow to the left of the title means "I found this link interesting. I think
other people will find it interesting as well."

It absolutely does not mean "This link is unique. Nobody has seen it before."
If that were the point, we could just pipe RSS feeds into the URL submitter,
couldn't we?

The very fact that links are appearing on the front page means that a lot of
people haven't seen them yet. It means that they got some utility out of
reading them, and it means that they thought others would too.

Sometimes, dupes are good. I don't remember who said it, but that Louis CK
interview that gets posted every once in a while, where he is talking about
how we're surrounded by wonderful technology and yet nobody cares, they said
something to the effect of "I wouldn't care if this was stickied to the top of
the page and everybody had to watch it every single day before they post. He
is making an excellent point."

Now, I think this is a _bit_ excessive, but the point stands. It isn't about
being unique, it is about being _good_.

~~~
ugh
Referencing past discussions is useful, though. That’s where I would see the
niche for this bot.

~~~
endgame
The difference between something like DupDetector and someone like MrOhHai on
Reddit is that DupDetector was polite and was more like "here's the earlier
discussion" as opposed to "you should not have posted this".

~~~
ugh
It’s still an understandably sensitive topic, hence my recommendation to make
DupDetector much more polite and friendly so that there can be no
misunderstanding about the intent of the bot.

~~~
endgame
That's an excellent idea.

------
pclark
I found your bot really annoying. I don't need someone telling me there is
another submission with no points and no comments.

It almost felt petty, that you were scolding the submitter for submitting
content that had already been submitted. The comments section of Hacker News
is one of the last places on the web that has not been hammered with noise.

I would be somewhat interested in how you solved this problem automatically
though.

------
ugh
I think a bot which references submissions that already have comments would be
incredibly useful, especially on duplicate submissions that don’t have
comments but received a few upvotes. I don’t think you should reference
duplicate submissions without comments, that just seems pointless. I don’t
think that dupes by themselves are bad, but stumbling across a story which was
already richly discussed on HN in the past without being able to find that
discussion is certainly not optimal.

I also think you should work on the bot’s politeness. It’s easy to perceive
the waltzing in of a bot that says nothing but “This submission has ended up
with the points and comments” and provides a link as rude. Technically, sure,
that sentence is not or at least not overtly negative but it is easy to
perceive it as such.

I would formulate the bot’s phrases consciously positive, showing that your
intent is not to be smug about submitters of dupes but that you just want to
help, you just want to provide a service.

~~~
icey
I wonder if it would have gotten a better reception if it was
"RelatedStoriesBot" instead of "DupDetector".

I used it as a way to find more comments on a topic, and liked it pretty well
for that. Of course, I liked it when RiderOfGiraffes was posting them as well;
although I wondered how he was able to actually read any of the stories
because he seemed to be everywhere with dupe reports.

~~~
RiderOfGiraffes
This has turned into a brain dump - apologies.

I had hoped that the DupDetector would give me more time to read stories, but
I found two things:

1\. There were many, many more dupes being found than I expected. Checking the
robot's output before confirming the posting took about the same amount of
time as it used to take finding fewer, but doing it by hand. Fully automating
it would only take a little more work, and maybe I'd then get the time back.

2\. I was reading a bit more, but I found that the additional material wasn't
interesting. I was probably already reading everything I found useful,
interesting, instructive, or engaging. I'm working on a script to help me with
that as well.

The problem I'm finding is the sheer volume, and most of it is repeats,
politics, repeats, TSA, repeats, wikileaks, repeats, Assange, repeats, _etc._
The proportion of material with deep technical content is much less than I
remember. Consider - as I write this anything older than 40 minutes old has
already fallen off the "newest" page. At that pace it can't all be worth
reading.

It's suggested that newcomers read the "news" page, and perhaps the "over"
page so they get enculturated with what this site is about. Similarly, it's
suggested that older hands inhabit the "newest" page so we can vote up those
things that deserve it, and flag the inappropriate.

But I can't keep up with "newest" any more, and much of what I would find
interesting is vanishing before I can find it. Searching deeper will find it
sometimes, but the 'bots were intended to help.

So I'm working on trying to help, working on trying to find the good stuff (by
some definition), and working on adding value.

So, for what it's worth, that's what I think. I hope it sparks off some
interesting or useful thoughts.

~~~
J3L2404
I like the bot (I thought it was you) and I think it would work if the OP
deletes the post. It would be up to them, as it probably should be because, as
the PG Essay kerfuffle earlier showed, sometimes posts should be seen again. I
hope you have time to give some details on the bot itself. In my opinion
DupDetector should live on.

------
kenjackson
I saw the comments didn't realize it was automated.

I think it would be more useful if integrated with HN at the submission phase.
Seeing something is a dupe of something else after its already on the
frontpage seems too late.

------
blahedo
In a semi-related vein, let me reference my own Ask HN from last week:
<http://news.ycombinator.com/item?id=1975950>

My claim was that HN should bake in a "dup" button that let users flag
duplicate posts, _not_ with the goal of removing dups, but with the goal of
cross-referencing them (and possibly sharing karma or some other idea). I
won't recap the whole thing here, but I find DupDetector an interesting
complement to those ideas.

------
momotomo
Ok, Reddit handles this really well in their submission system.

If you post a link, and it exists, it simply flags: here's all the posts that
already have this link. Do you still want to proceed?

Gives you the choice of jumping into the existing discussions or re-posting
(eg, if the old subs are years out of date)

~~~
RiderOfGiraffes
A similar thing happens here if you post an identical link. Your submission is
disallowed and acts as an up-vote for the original.

But too many URLs are different, sometimes subtly, sometimes not so. I've seen
submissions with just an extra hash on the end. There are the submissions with
all the feedburner crap cluttering it up, and so on. This was an attempt to be
more thorough about detecting duplicates, doing "more properly" a job that's
already done, and therefore presumably desirable.

~~~
momotomo
Ah! That's good, didn't realise. The upvote is a good idea, but yeah, there's
a million different ways to submit the same online content.

The function is definitely desirable, but it seems better suited as a browser
plugin or such. Although it's not literally, botting it and having it
contribute almost makes it mandatory for all users, versus being an optional
component.

It's the root of the use case here - the functionality is great, but not all
users find it desirable.

------
johkra
I first saw it today and I was about to thank the administration for this new
feature when I realized that it was a user's (or bot's) comment.

The functionality is imho highly desirable, but a more dynamic solution
(regularly updating the number of comments, for instance) would be a better
implementation.

------
chipsy
I agree with the people saying "market it better." As it is, DupDetector looks
like linkspam and is thus ripe for angry downvotes.

Change the focus from strict dupe-finding to "add additional context to the
article people are already looking at." Copy some data about the comments,
note the age of the previous discussions, even reproduce the highest-rated
comment. These things will give it a more positive/helpful image.

------
revorad
I didn't realise it was a bot! It would be awesome if you share your code and
we can get PG to incorporate it in the site somehow.

------
trickjarrett
I had no issue with you running it, I think the negative response were people
reacting out of fear of letting "novelty" accounts get going with this as the
beachhead where they had to make their stand.

~~~
shib71
Twitter is a good example of what can happen when "meta" accounts get out of
hand. I think HNers find them fascinating in theory, but feel that they dilute
the community in practice.

------
6ren
A near-simultaneous dup _does_ divide discussion. It would be better if only
one was allowed. Quite often, breaking news is divided among two or three
posts.

Dups that are distant in time are more complex. One aspect is that people
might not have seen the previous stroy - so it would be good to show it again.
Another aspect is that the previous discussion is lost, leading to the same
points being repeated, instead of (possibly) being built upon - so it would be
better to _resurrect_ the previous submission and therefore discussion.

One solution is to detect and combine dups, but enable them to launch the
story fresh, _if_ sufficient time has passed.

There is in fact already a discrete implementation of this: stories over a
year old (I think that's the period) can be resubmitted as a new story. So I'm
suggesting a continuous version of this idea, where the "newness" of a story
gradually increases, til it becomes completely new after a year. "Newness"
could be implemented with a factor on the story-score. This would enable old
stories (and their discussions) to return to the front-page.

------
anthonycerra
I think the idea is really interesting and itself is very HN. The thing
getting you down is the way people misuse the down vote.

Disagreeing with something is no reason to down vote a comment. I'm only at
~100 karma, but I'm not waiting to get to 500 to down vote people I disagree
with. Disagreement spurs discussion. It'd be pretty boring if everyone agreed
on everything.

Since you have to be "qualified" to use the down vote, it should be reserved
for instances where the user is not being a respectful fellow hacker.

All that being said, maybe you can include a line saying "Just a bot doing
research" before it lists the duplicates.

------
oomkiller
It would have been fine if you ran the experiment, WITHOUT polluting the
comments with useless dupe links. You could then write up a blog post or
something showing what you had found.

~~~
tvon
The dupe links seemed useful to me. If something has already been discussed at
length, I'd like to know about it.

~~~
oomkiller
That's a fair point, but still, the posts were made by a bot, who can't decide
if there is useful discussion on another page or not.

~~~
sp332
DupDetector _can_ tell if there's a discussion by the number of comments, and
it uses points to determine the "usefulness" of the discussion. You can find
examples here: <http://news.ycombinator.com/threads?id=DupDetector>

------
aeurielesn
> _there's enough information to tell me what I want_

Then, what is it?

I think you have not been clear enough about _what_ you achieved.

~~~
krschultz
In 3 days, wikileaks will expose what the dupdetector really wants.

------
vessenes
I agree that there's just a slight miss re: HN style and community behavior,
and also think that it's a beneficial service. Does the robot update the
comments automatically?

I wonder what people would make of you using your 'real' account with a note
that it's a robot posting. On some sites this would be considered karma
whoring, but maybe here it would get a better response?

------
pmichaud
I was watching this with interest, actually. I think it would be great to
somehow tie threads together. blhack is right when he says that social
bookmarking is about finding things interesting, not unique, so dupes aren't
evil in themselves. It's the dilution of discussion that's the issue. So if
you can solve that issue elegantly, you win.

------
coderdude
What method are you using to conclude that two stories on two different sites
are about the same topic? That's always been a feature of certain sites that
captured my interest, but it seems like so much can go wrong. Achieving decent
accuracy must be very difficult. Have you written about this anywhere?

------
JoshCole
I actually was really excited when I saw this originally. I made a point of
going through all the bots posts to see what you were doing with it. I got the
impression that you were trying to tweak the bot to the point that people
would find it interesting. So for me it is sad to hear that things aren't
working out. I was hoping that DupDetector would eventually be pointing to
year old discussion pages; a discoverer instead of a detector. It was only
links to the articles that didn't gain any traction that bothered me. They
seemed like noise.

------
zoomzoom
When I first saw the dup detector working, I thought that this was an
interesting problem, and one that I want to see solved. I have often posted a
comment on the losing thread only to see that this was the surpassed by the
next post.

In the sense of an interesting hack, it was fun and pragmatic. Thanks for
doing it. But it seems that the community doesn't see the need for the
service.

------
te_platt
Simply stating "More comments here..." then showing the additional submissions
may have a better response. HN isn't so much a discussion among a few friends
(although it feels that way at times) as much as large public park where
people come in and out at all different times. The fact that topics repeat (at
least some) isn't such a bad thing.

------
lazugod
There were cases where the linked duplicates would have only one or two
comments. It was annoying to see them listed - not because there is nothing to
read, but because it's a vote of no-confidence that those particular posts
would have ever grow a larger discussion.

------
p_nathan
I personally really like the idea of the duplicate detector. I didn't realize
it was going on. I really don't want to see conversation split across threads;
I'd rather have one conversation per link.

my 0.02 c

------
pasbesoin
I saw a few posts from the account, but I didn't know any more.

I've been seeing an increasing number of duplicates myself, including entries
using the same URL with an octothorpe added at the end -- a pretty obvious
ploy, in my mind.

When HN started up, it was about the _conversation_. If someone else had
already posted a link, great: People just joined in the conversation there if
they had something to say. From my perspective, it saved the time of having to
post it. And, in that I was often learning as much or more from the
conversation on HN than from the links themselves, I was happy to find that
conversation focused in one thread.

I'm not sure how, but I'd like to see the site steered back in that direction,
if possible. (I have a few half-baked "ideas", but PG and crew have already
demonstrated themselves to be more insightful than me -- in my own mind.)

I do think identifying the HN member behind DupDetector might be a benefit, to
demonstrate their investment in the community and therefore, in my mind at
least, credibility. Yes, I see the email address now, and I half remember off
the top of my head whose domain that is. But I might have been a bit more
supportive if I knew who was behind it and that they had an established,
positive history with HN.

Anyway, just my 2¢; spend them before you need a wheelbarrow full.

------
steveklabnik
I really enjoyed this.

