Hacker News new | comments | show | ask | jobs | submit login

This is the difficulty in fighting spam. Google's goal is to determine which sites are the "best," and use, as their measure for doing so, the number of natural (organic) links as a scorecard. And as soon as people realized that's what Google was looking at, they started to mimic the organic links to boost their own ranking.

In a sense, Google is the largest crowdsourced project of all-time. It's a lot like Reddit or HackerNews in that every link is an upvote, but the genius of Google is that each link carries a different weight, and that links are a natural byproduct of using the Internet. In short, the people contributing to the crowdsourced ranking system don't even realize they're doing so most of the time. They're just doing what they like and leaving a byproduct of doing so (links, social signals) that Google can use to tell you which sites people consider valuable.

But that means that once people realize what Google is using to rank, they can mimic those signals, and sway the algorithm in their favor. The problem Google is going to run into is that once spammers can closely (and at times programmatically) mimic what is happening "organically," Google's algorithms cannot tell the difference.

Right now Google's approach seems to be ignoring or highly devaluing portions of the Internet that have been overrun with spam. Article submissions, blog comments, and now apparently guest posting, which sucks for people that do really high quality, organic guest posting; for Google that has to be collateral damage. Spammers will simply move on to the next portion of the Internet, mimicking what Google still uses as a ranking signal. It's an endless battle.

One of the big things I see happening now is entire website hijackings (I've been meaning to email you about that, Matt). I did a quick little report on the search engine results for "Viagra," and 81 of the top 100 are hijacked websites, including a client I have to upgrade to a newer version of Drupal, as the older one has been compromised. I don't know if or how Google will win this battle, but it's far from over. I honestly feel like the new way we gather data to rank websites, and what will be successful in 25 years, will have to be completely unrelated from what Google is doing now, and much harder to manipulate than spreading links all over the Internet.

"One of the big things I see happening now is entire website hijackings"

It's true that website hacking remains a big issue in the spammiest areas of the web even though it's completely illegal. Unfortunately, there's a lot of unmaintained, unpatched web servers out there that blackhats exploit. It's fundamentally a hard problem, but we've been working on the next generation of our hacking detection algorithms.

I've always wondered why we don't see more startups offering hacker protection, detection, clean-up, etc. Companies like McAfee made a lot of money protecting personal computers, and there's a similar opportunity on the web server side.

I think a lot of companies are scared of how good hackers are these days. No one wants to guarantee protection because a hacker of a more elite variety will come along and take that as a challenge, by hacking into the sites that X company has created.

I definitely agree with you that it is somewhat of an under tapped market, but definitely think it needs to be head up by the right individual(s).

It could be scary to try to outwit hackers, but even if you don't tackle the problem head on, a website backup and recovery service could be a big help.

If that's not hint from someone 'in the know' about a startup someone should start, I've never seen one.

Right, but is it really a pain point to the negligent web site owner who doesn't care enough to make sure it's secure - or is it more of a pain point for Google trying to determine what's quality and not?

"there's a lot of unmaintained, unpatched web servers out there that blackhats exploit"

To be sure there are many patched maintained web servers, applications that are also exploited (by someone for some purpose).

The best and the brightest get hacked and software maintained even by professionals regularly needs to be patched for new exploits. (Take Flash which seems to be running at between 1 and 4 updates per month). Or even OSX security updates.

I know everyone seems to think it's the other guy that doesn't have his act together and isn't following the obvious advice but there are many "other guys" that are quite capable and still end up having problems and being exploited. (Source: stuff that I read in news stories the same as everyone else.)

It only applies to WordPress hosting, but that's one of the main reasons we use WP Engine. They actively monitor and protect against hacked WordPress installations.

I know one: http://sucuri.net

Focused exactly on what you mentioned (web site recovery, monitoring and protection).

*I work there :)

"what will be successful in 25 years, will have to be completely unrelated from what Google is doing now"

I know very little about SEO or website design, but isn't google already kinda tipping their hand to what the future will hold?

Without adblock, a quick google search shows that the entire top half of the result page are ads. To me, it's fairly obvious the future will bring pay-for-priority search inclusion. You want a top spot? Break out your wallet.

Whether that's a good thing or a bad thing, only time will tell. I'm more curious about search engine competitors 25 years from now. Google's had a great decade, but that can't last forever ... can it?

The top half of the results page will be ads if you searched for something so generic that there essentially are no good organic results. Certainly if you search for something specific, it won't be like that.

For instance, I recently watched a terrible movie, and I'm trying to remember what it's called so I can warn people off of it. Searching for that is a specific piece of information, not a generic "noun"-type thing:


Commercial intent is more important than how broad / generic the term is.

A query like "sports" or "news" may or may not have an ad in the search results. Keywords like "credit cards for people with poor credit scores" will almost always show lots of ads, even though the query is quite specific and contains 8 words.

In some particularly valuable verticals (like hotels) Google then further adds their other paid verticals to the results in addition to the AdWords ads.

One of Matt's videos (not sure which one) mentioned online publishers tend to focus a lot on say 10% to 20% of highly commercial terms, while often paying much less attention to the rest of the searches as the commercial terms typically are far more valuable than the informational ones. (Mesothelioma lawyers can afford to pay more for traffic than say people selling flour, or people offering recipes that contain flour).

Even though fighting spam is a whac-a-mole game, you make it sound harder than it really is.

First of all, it's really hard to replicate high-quality signals. Yes, we've got spammy guest postings and spammy comments and entire websites overrun with spam. But if you were to analyze those links, you'll notice that they are still islands. As in, do you see any Viagra-related links on Hacker News or in your Reddit subscriptions? Do you see Viagra-related links in your Twitter or Facebook stream? What about all your other news channels? The only places I see Viagra-stuff these days is either in my Gmail's Spam folder or on porn sites. And I don't know why, but I'm not seeing much spam in Google's search results while I'm logged in, maybe it's the stuff I search for, or maybe Google learned my interests - but in Incognito mode I get a lot more spam.

And second, if there are dark corners of Google's search engine, search keywords that have been overrun with spam, then Google is partly to blame because they've turned a blind eye towards spam for far too long, as they tolerated and still tolerate Adsense spam and content farms.

If Google is indeed facing a spam problem, then there's a whole lot more they could do. Off the top of my head, why not penalize websites hosted on old, insecure versions of Wordpress or Drupal? Why not expose a "Report Spam" button to logged in users? Why did it took so long for Google to detect and ban scraped websites?

By talking about Viagra you are really picking the low hanging fruit. And yes, I still see plenty of spammy articles on twitter or facebook, and reddit and even here. People have just gotten better at hiding them, so much that they even produce an articles moderately useful to a lot of people.

Great public relations campaigns make people think they are reading the news without even thinking of public relations. http://www.paulgraham.com/submarine.html

SEO is sort of the same way in terms of there being a range of options. Some people might use SAPE, Xrummer or Fiverr or such (nude person with a URL streaking at a sporting event), whereas others might use more nuanced strategies where any SEO impact appears incidental.

Design itself can play a big roll in the perception of spam. Designs that look crisp can contain nonsensical content without feeling as spammy as they are.

But another factor here is that Google wants to keep raising the bar over time. Things that are white hat fade to gray then black as they become more popular & widespread. And as Google scrapes-n-displaces a wider array of content types, that forces publishers to go deeper (if they can do so without bankrupting themselves).

Google handles content spam much like an email blacklist. What we need is a whitelist alternative, at least for searching spammy subjects like healthcare advice.

How could this work? Well my company is trying something like that now, though it is early going. The easiest way to do this is by filtering the Web by domains instead of individual web pages. This alone removes several magnitudes of complexity.

A couple issues with this:

many topics tend to bleed across niches

sites themselves will change purposes over time

how sites are monetized (and other user experience choices) may change over time

people buy and sell sites

sometimes the desired information is only accessible on sites which are not particularly popular because they are not aligned with moneyed interests

sometimes sites that are popular might be popular because they are inaccurate, conforming to an expected + desired bias

if the whitelist is black and white, entities may change their approach after being well trusted (one of the elegant aspects of Panda is how it can remeasure over time)

What we need is a whitelist alternative, at least for searching spammy subjects like healthcare advice. How could this work?

One idea would be to have sites that concentrate on some topic(s) of interest, where people can submit interesting links they find directly, and where others who share those common interests can lend their support to promote the best material and share other related links they've found themselves. I know, I know, it's a radical concept, but I truly believe there's some potential there...

And then you could annotate each website with a tag about what topic it is to help with search.

Then you could maybe even organize them into publicly based on those tags. You would need a hierarchy of tags then though but that would give you a rather comprehensive, yet still curated, view of the web...


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact