Hacker News new | past | comments | ask | show | jobs | submit login
Ask.YC: Spammers have finally hit News.YC -- what do we do about them?
30 points by pius on Feb 14, 2008 | hide | past | favorite | 66 comments
DNS Blackhole anyone? Here are the top 10 newest stories at the time of writing.

1. Hottest Artist Out Alicia Keys (aliciakeysvideos.net) 1 point by 1podgirl 4 minutes ago | discuss

2. Shrek 4 or the tv show? (eddiemurphyblog.net) 1 point by 1podgirl 5 minutes ago | discuss

3. American Gangster released on dvd (denzelwashingtonblog.com) 1 point by 1podgirl 6 minutes ago | discuss

4. Rihanna in a Video did you see it (rihannavideos.net) 1 point by 1podgirl 7 minutes ago | discuss

5. The true meaning of Testimony (testimonyblog.net) 1 point by 1podgirl 8 minutes ago | discuss

6. Urban Poetry (defpoetryjamvideos.net) 1 point by 1podgirl 9 minutes ago | discuss

7. For All The Treky Fans (startrekvideosclips.com) 1 point by 1podgirl 10 minutes ago | discuss

8. Updates on the New Star Wars (starwarsvideos.net) 1 point by 1podgirl 10 minutes ago | discuss

9. All Free Def Comedy Jam Videos (defcomedyjamvideos.com) 1 point by 1podgirl 11 minutes ago | discuss

10. Heres that site I told you about it should me how to master Madden (xbox360cheatsblog.com) 1 point by 1podgirl 13 minutes ago | discuss

*

user: 1podgirl created: 18 minutes ago karma: 1




The simplest technique would be for the YC News gods to grant us a "report SPAM" button. The site would then be self moderating.

I would also suggest something like granting karma points for sniffing out SPAM in case the YC News gods agree that a reported link does indeed turn out to be spam.

Alternatively, a system of "open moderation" would be good as well (to lessen PG's load). In this case, moderation of the site would fall upon a rotating schedule between users that have reached a certain level of karma points. Like jury duty, you have a queue of sites that have been flagged for review, and the member's job is to merely approve of something as spam or not and have it dealt with. A link would have to be scrutinized by the random "committee" (which ideally should be a random percentage of YC News members on a regular schedule) and a majority vote would determine collectively if a link is spam or not.

---------------

Apologies for making this post so long....

1) have a "report" button

2) who ever is on "spam patrol" would see a reported link highlighted

3) "spam patrol" would have their "report" button substituted by a "confirm" button

4) after enough of the "spam patrol" confirms a link as spam, then the link is automatically marked for deletion

The advantage of such a system is that there would be some filtration of the links and spammers would (hopefully) be discouraged.

An "emergent" advantage would be to uncover spam sites. I can even imagine a number of social sites adopting a scheme creating a collaborative spam filter (ok, yes I'm spinning this out of control, but what the hell, I'm sure everybody here thinks like this!). This would be a social solution to a technical problem.

If reddit had such a system, then a collaboration between YC News and reddit would pre-emptively eliminate confirmed spam sites from one another. Using this example, a number of social sites could immediately benefit from an open database of spam filtering (akin to the use of OpenID). So a confirmed spam site on reddit would automatically translate into a confirmed spam site on YC News, or whoever else subscribes to such an open database.

Is this too complex of an idea???


It could be tricky to get right. For example, someone could censor a political blog from a social site interested in politics by spamming it to sites that are not interested in politics.

And, of course, if this kind of filtering gets popular, spammers can simply set up a separate website for each submission to a social site.


Awesome point about the political blog example. Perhaps attaching some sort of meaning to the filtering instead of a black/white SPAM vs. NOT-SPAM would be an added advantage of leveraging human judgement. So for example spam on a political oriented site would include news about celebs; or spam on gawker.com would equate to political links.

So the levels of filtration could be:

1) SPAM

2) Not relevant (to celeb/political/geek/etc nature of site)

3) <suggestions please.../>

The problem with this is the notion of "purity" between a collection of social news sites. Too much homegeneity (is that a word?) is problematic as diversity is what brings value to social sites. I suppose having a level of how stringent the filtration should be an option for subscribing to such a database. There should also be a way to decentralize this as centralization is not good (just my opinion...).

As far as spammers setting up separate websites for each social site....GOOD POINT!! A Bayesian filter that would introspect different links in a database could help match different sites at the content level with some degree of confidence, thus exposing the level of sophistication a specific spammer and identifying higher spam threats. Similar content could be matched between different databases (if decentralized) and such information could propagate faster between different types of social sites identifying spammers by content.

Because human judgment would trickle down to a central location (how can this be decentralized I wonder...) means that the timing of the appearance of sites in the database would possibly be significant (wouldn't it?), adding yet another insight into how spammers might operate (ok, perhaps I'm stretching it here).

I think the best way to look at it is to have a human filter (that depends on mass collaboration) up front and then later use a Bayesian filter to associate at the content level the different websites. Creating a web crawler to do this would be pretty trivial (I would imagine).

------------------

How would spammers game such a system? One thing they can do is NOT report their own links as spam. This does not prevent others from doing so. Spammers could conversely report other sites as spam thus muddying the waters through volume. I can imagine a scenario where they would muddy the waters through their own sites, which would translate into merely blocking that site as a preventive measure (this is DEFINITELY fraught with problems that I have not thought of yet!!!). Similarly if they attempt to muddy the waters through a legit site, then there is the advantage of genuine human involvement and randomness.

This becomes an issue of membership to the database: what are the criteria to submit to the database and be a contributor to flagging spam sites? I suppose subscribing to such a database would be free for all.

A possible solution to centralized authority would be for everybody to contribute, but at the same time a site can allow to trust the quality of reporting from certain sites and exclude others; this could have yet another "emergent" quality of identifying collaborations of sites that wall off spammers trying to muddy the water: spammers can muddy the water as much as they want, but they won't be heard. This would eliminate the need described above for judging different sites; also, it could provide new sites for a basis of which sites to trust for authority. Theoretically the head of a long tail of trust should emerge, and hopefully this could possibly be a shifting mass of authority. This decentralization of authority would allow spammers to do their worst and still not have an effect (wouldn't it?).

But somehow I feel intuitively there might be a way to game the system...any thoughts???


Another easy way for spammers to game the system: Set up a link with a copy of a legitimate, interesting news from another location (perhaps from a social site with similar interests) and submit it to a social site. Wait a few hours while the users classify it as non-spam and comments start to roll in. Then, change the page to be mostly spam. (You can't fight this by taking hashes or diffing pages; most pages these days have dynamic content.)

Not only does this spread the spammer's message, but it also breaks down your proposed web of trust between social sites; sites that approve links that later turn out to be spam will be seen as untrustworthy.


Yes, this would break it temporarily. In the long run this strategy would lose. With enough eyeballs scrutinizing a site, such a system would eventually identify a site as spam and have it marked because of the dynamic nature of human judgement.

So let's say that out of 100 visitors, 20 see legitimate content, the next 20 see spam, thus alerting the system that they have encountered a spam site. The Bayesian filtering would be for sites with established spam content and then matching it against other established spam content. This would be a cost of 20 visitors exposed to spam while 20 others are exposed to legitimate content. The remaining 60 users are diverted. Sounds like a reasonable tradeoff to me...

Spammers could keep on shifting the content of sites, and this would make it more difficult to apply the spam filter for similar content (but humans, the first line of defense, aren't so easily fooled); this is where site moderators can identify users as potential spammers and give them the boot. So instead of having a 1:1 filter for domain:content, you could have a 1:n filter and then match that against other sites with similar 1:n "content shifting". A danger here is that legitimate content gets associated with spam and thus gets filtered out; this would require more sophistication on behalf of the Bayesian filter in associating content - but I'm certain this is a solved problem that can be put to use here.


It's more complex than: add CAPTCHA to create account screen.


I run another social news site (realestatevoices.com) and captcha doesn't help. The people inputing these links are 'SEO' firms based out of India.

Validating email accounts is worthwhile though. But keep in mind these spammers are actual people not automated in many cases.


interesting site...how long have you been up and running?


Not all spam is automated.


You are right; at the same time there is nothing novel about CAPTCHAs. I also have a habit of running wild with ideas, and I suppose it's beginning to show...

But hopefully my hatred for spam is showing through as well!


Spammers hit News.YC the day we launched. There is spam more days than not. Usually the editors catch these before anyone sees them.


Is spamming much of a problem for the editors?

I suppose I should have asked this when you posted instead of going on my (useful?) brainstorm below (and I really was just throwing out some ideas)...


Not huge. There's not a lot of it, and we have several tools for dealing with it. It was unusual for those spams to sit for 3 hours on the new page without getting cleaned up.


Short term solution: Ban account 1podgirl.

I like the idea of prohibiting posting of articles until karma reaches a certain threshold, and limiting the amount of comments/hour by karma. So,

for karma < 10 comments/hour = 1

for karma >= 10 comments/hour = karma/10


Heaven help us when nickb goes sour and hits us with 15 spam articles a minute!


Muahahahaha... MUahahahaha. * wrings hands evilly *.


An alternative to a normal ban would be to put identified spam accounts into a sandbox mode, where it looks to the account holder like its submissions are working, but everybody else doesn't see them. That way it doesn't start seeing errors and think, "Oh, time to create another account."


i don't think you should force people to post comments. just rate limit their link submissions until they get 50 karma.


I agree...actually nobody should be allowed to submit 10 in an hour!


What about adding a hidden field to the submission form where if it contains a value upon submission, it's obviously spam filled in by the bot (if these are bots and not actual people doing the submitting). Ryan Grove uses this technique on his blog and said it eliminated a lot of spam:

"Yes. Riposte uses a very simple but very effective form of comment spam prevention: the comment form contains extra fields that are hidden from normal browsers via CSS but look like comment fields to spambots. When the form is submitted, if any of these fields contain content, Riposte denies the request.

This technique has been in use for over a year at wonko.com and has been effective at stopping virtually all automated comment spam attempts."


I also use this technique for SeekSift.com (I first saw it on Ned Batchelder's blog http://nedbatchelder.com/text/stopbots.html ), with good results (no spam bot signups so far).


These are likely human spam posts, since a user account is required to submit. (edit: reading the comments, maybe spam bots are getting more sophisticated than I thought)


what about posts costing you karma?

maybe the post is 1 karma, and each link in the post is an additional karma point.

if it is a post of value, then you should earn at least more than 1 karma to replace it.

it makes your post an investment of karma. I think its a good lesson for entrepreneurs too.


I think that is a really interesting idea. The only issue would be if you hit 0 on your first post you could never post again. So if your at 0 you can post but only once per day or something until you post something that earns you karma.

It is a simple system but seems like it could be very effective. Kudos


Best suggestion I've heard so far.


The problem with this is that it's too easy for spammers to make new accounts for every post.Then all you do is limit the number of posts (which may or not be a good thing) without doing much to deter spam.


I like this idea, it's very much like the #xkcd-signal experiment, designed to force interesting conversation (especially a problem in certain IRC channels...)

http://blag.xkcd.com/2008/01/14/robot9000-and-xkcd-signal-at...

It will also make non-spammers think twice about posting useless comments or submissions (assuming they either have low karma or actually care about their karma level)


Ok, now that i think about it. Maybe a chemical response is more effective.

What about having a risk, recovery time.

If you post, you can have karma at risk, if it gets a few negative responses, it can recover, but if it gets 3 or more, the karma is lost.

We can all know what it means to give 3 strikes, so it only takes 3 of us to call it spam, while the post is risking karma for each link.

I feel like this has a lot in common with learning systems in how neurons sum signals and vote an outcome.


Good idea, but how about instead of posts costing you karma, spam costs you karma. Then if your karma goes negative, you cannot post until it becomes positive again. That way it doesn't penalize new users with low karma, but it will only take two bad posts to render a spam account ineffective.


Fundamentally, this sort of spam is about getting links to their sites (first for direct traffic and second for PageRank).

So, you should attack the links and not the accounts:

0. Check your URLs against blog spam blacklists. Exclude recently registered domains (all of the ones mentioned are less than one month old).

1. Links on stories that have been become popular should be real hyperlinks.

2. Links on unpopular stories should go through a news.yc redirector (since this means they don't get PageRank)

3. Stories that go nowhere (likely spam) have their links removed and a text version of the link placed next to the story name for anyone who really wants to follow that link.

4. A 'spam' button could be handy.


I was just about to submit an entry on this but you beat me to it :-)

If it persists it should definitely be adressed - one of the great things about this place is the high signal to noise ratio and I would hate to see that drop.

But since PG basically invented bayesian spam filtering I am sure that it can be solved. If he isn't too busy doing arc that that is. ;-)


Did he really? I thought he just proposed an improvement to it or something like that.


Wikipedia says that while other people worked on it earlier, PG's essay really popularized it:

http://en.wikipedia.org/wiki/Bayesian_spam_filtering


When I first heard the term "Bayesian spam filtering" and its description, my response was, "Oh. I thought that's how they always did it." I didn't realize it was a big innovation.


I think pg has something in place already, though it takes a little time to work: http://news.ycombinator.com/item?id=99975


Maybe this is a good time to start allowing negative karma and downvoting of submissions. If someone has negative karma, then they can't submit until they start making good comments at least.

I know pg is worried about mass downvoting, and I think a good way around that is that downvoting anything costs one karma point for the person doing the downvoting. That way you only downvote things that really deserve it... And if you're a chronic downvoter, you wouldn't be able to do it for long as you'd hit Karma 0.


The problem with down-voting is that there is no distinction between a spam report and judgmental calls on the content of a link. It changes the dynamic and the quality of links that make it to the top.

Having recently come from reddit (and I must say I absolutely LOVE the energy of YC News!!!), I noticed that the down-voting makes submission more competitive and it makes the site less varied and more homogeneous.


I disagree (to a degree). If enough people want to spend karma on a downvote then maybe the link doesn't really belong in hacker news anways (I'm looking at you, XKCD posts).

If a submitter regularly submits garbage, how much different is it than spam? I mean, if a link is of low quality, maybe it ought to be downvoted & buried.


But the fact that other quality links can be liberally up-voted by others would prevent low quality links from rising. The only advantage I can see with the down-voting is that it will make (subjectively) quality links rise faster and poor links sink faster - but this is not necessarily an advantage, at least I don't think it is (I could be TOTALLY missing your point here!).


That will work only if the worthless submissions don't overpower us...


PermissionToSubmit = (Karma > x) * (AccountDaysOld > y)


This would incentivize some of the insidious viral marketers that leverage social news sites to create fake personas out there and eventually masquerade submissions for profit... that is if YC News faces the problem of large numbers.

Also makes YC News elitist as well as shuts out new members, which I think would defeat the purpose of such a site. Fresh blood helps the mental juices flow as do varied ideas.


Good points.

I just assumed that the natural progression was: lurker, voter, commenter, submitter, addict.


You are actually right on the money with this progression. Also, YC News is small enough (thankfully) that we don't have much of a spam problem. I'm just attacking our current situation as a global/general spam problem that social sites have.

Having an opportunity to brainstorm gives me SO much enjoyment!


FWIW, I commented before I voted. Only seemed fair to allow myself to be judged before judging.


you could only be allowed to submit X number of stories a day based on your Karma... So as you earn some respect you can do more. If you start with 1 you could only post 1 article a day which would make it less worth while to spammers.

Or add flagging on users and if it passes some threshold it has the admins review the account and take action if it really is spam.

Just two thoughts off the top of my head


Some users would get too powerful in the long run. These users would be the head of a long tail of submissions and could have disproportionate influence.

Of course, this depends upon YC News having the problem of a large user base. Also, I could be totally wrong in this imagination simulation...


How about restricting the number of submissions allowed in a given period, scaled by karma? This would encourage people to only submit meaningful news, and it would also offer proactive protection against spam from low karma users. Just a suggestion.


When I first started blogging, I wanted people to come read my great words of wisdom (naturally. Why am I not a god of publishing yet?!?) So I hit on the idea of tagging my Slashdot posts with a line at the bottom plugging whatever I was blogging about that day.

I made it a point to post interesting relevant comments to Slashdot, but at the bottom I would always ham it up with stuff like "Giant Sponge Moon Found Orbiting Saturn!"

People gave my comments 5 ratings almost all the time, but it really, really pissed off a group of \. readers. Was that spam? I'm still not sure the answer to that question. I know IpodGirl is a spammer, but there are all kinds of edge cases too.


pg's pretty good about killing spammer accounts as they come... if it becomes a problem, new users could have a probationary period of one approved submission before their submissions show up. and a reddit style filter on submission volume below a certain karma would prevent a spammer from filling up the new page.


We're only in trouble here if PG and the others used to implicitly moderate traffic are likely to upvote those links. Otherwise, we can safely ignore them. Adding a 'report submission' button might be prudent as it would make the process of eliminating junk links more efficient.


This was probably mentioned in some form before, but what about a combination user spam button/automatic temp-ban? If the last X posts by user were marked as spam, ban account for X days. Two temp bans = perm ban.


Easiest way is to just ask really simple technical questions in the submission area as a captcha.

Not sure which technical questions would be 100% answerable to everyone around here, but it is Hacker News.


What CAPTCHA emphasizes is that it is automated. Not only the verification process, but also the process of generating problems. If the problem space is fixed, eventually the list of answers is formulated if it is worth. In practice if spammer's think it's not worth we can outperform them by more problems, but that's kind of passive, giving up control to them. In general, what matters is asymptotic ratio of our cost vs spammer's cost---if it is both linear, it's not very promising. If the ratio reaches zero asymptotically, it means the more we and spammers compete the better we do, hence the control is ours.


Great idea. And not a bad one that would be applicable to CAPTCHAS as a whole. If non-technical. E.g. ask the user like 'What color is a US Dollar Bill'


I've lived in the US, and I had to look at the dollar bills I happen to have in my wallet. Green, yes, but did they add other colors recently? Not a very foreigner-friendly question.

The problem with technical questions, is that no matter how simple, people here will argue about the correct answer ;-). Unless we just give up and say that "It depends" is always correct.


Btw, is there human moderation at News YC? If not, maybe provide the ability to the top 10 leaders to delete such spam links.


Yup, I think pg mentioned before that there's a list of trusted users who can edit missleading titles, delete posts or change urls. (In the early days pdf links got changed to scribd links instead by moderators)


See those arrows? That’s human moderation. Just don’t click on the up arrow if you think it’s spam.


Spam blocking is probably a 1 liner in Arc.


"When Arc gets ahold of spam, it turns it into prose worthy of Shakespeare".

"Arc doesn't need to block spam, because spam is too afraid to get anywhere near Arc".


Arc creates classic lit. spam? :-)

http://www.google.com/search?q=classic+lit+spam


I don't know... don't vote them up?


A captcha on the register screen?


negative karma should lead to a user ban.


Ouch-- I can't wait until the first YC lynch mobs begin.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: