Hacker News new | comments | show | ask | jobs | submit login

My advice would be to use robots.txt to block out the autogenerated pages; when users search for e.g. a long-tail phrase and then land on a page that's nothing but affiliate links with a lot of keywords, they tend to complain to us.

Users would be happier if they landed on the root page of your site or the root page of http://www.superfillers.com/ or http://www.filleritemfinder.com/ than if they landed on a deep page full of links and unrelated products.




Ok, well I am willing to give that a try. Just implemented. Thanks Matt!

http://filleritem.com/robots.txt


Cool, I'll send in a reconsideration request. You might want to add http://filleritem.com/iframe.html?q= as a pattern too.


Rather than "banning" an entire site that seems to have some spammy pages, why not index only the home page, and flag it in the Google Webmaster Tools dashboard?

Also: are any specific notices or warnings given in Google Webmaster Tools, to inform the webmaster and give some clues about the findings and decisions of your team?

Thanks Matt for taking time to explain things here.


Google needs to be more open and helpful for small businesses. A local small business with a new website has huge challenges to overcome.


Your comment to my response is too deep for me to reply, so I will reply again here.

Thanks for submitting the reconsideration request. I just added Disallow: /iframe.html?q= as suggested.


http://www.filleritemfinder.com/ is the first hit for "filler item" on google right now.

A search for hacker news results in a page full of affiliate links, just as the example you gave above. Only difference is that they didn't re-write the URL.

filleritemfinder.com has no robots.txt that I was able to pull up.

So, filleritem.com, a google customer, was blocked, but filleritemfinder.com, doing the same thing, is the number one result.

Further, shouldn't this kind of advice be given to people who appeal being excluded from the index? Or should we all post to Hacker News when it happens to us so that you can come explain directly?

I think %50 of the problem is the arbitrary picking of sites to block (and it's not working, btw[1]) and %50 of it is that google seems uninterested in explaining or advising people when it happens to them.

[1] Been buying gear for a project lately, and so doing a lot of google searches in the form of product-model-number review or product-name review. Overwhelmed with spam sites, and mindless human generated spam sites like dpreview.com, etc.


When I do the search [filler item], one of the top results I see is http://www.makeuseof.com/tag/top-5-amazon-filler-item-finder... which shows five different services to fill this information need, and that also has pictures. I do think that's a pretty helpful result.

I mentioned filleritemfinder.com as a random example (there are many of these services), but filleritemfinder.com appears to use AJAX to keep results on the same page rather than making a new url.

"filleritem.com, a google customer, was blocked, but filleritemfinder.com, doing the same thing, is the number one result."

The filleritemfinder.com site is not doing the same thing, because it's not generating fresh urls for every possible search. But you're not really suggesting that we should treat advertising customers somehow differently in our search results, are you? The webspam team takes action regardless of whether someone is an advertising customer or not.

"shouldn't this kind of advice be given to people who appeal being excluded from the index?"

This advice is available to everyone in our quality guidelines. It sounds like the site owner reached out to the AdWords team, which gave him clear guidance that the site violated the ads policy on bridge pages. It sounds like the site owner also filed a reconsideration request, and we replied to let the site owner know that the reconsideration request was rejected because it was still in violation of our policies. It doesn't look like the site owner stopped by our webmaster support forum, at least that I could see in a quick look. At that point, the site owner did a blog post and submitted the post to Hacker News, where I was happy to reply.


Would it be likely that if the site was to use another monetization plan that it would be less likely to be penalized e.g. advertising instead of affiliate links?


I recently talked for about a minute about the topic of "too much advertising" that sometimes drowns out the content of a page. It was in this question and answer session that we streamed live on YouTube: http://www.youtube.com/watch?v=R7Yv6DzHBvE#t=19m25s . The short answer is that we have been looking more at this. We care when a page creates a low-quality or frustrating user experience, regardless of whether it's because of affiliate links, advertising, or something else.


Thanks Matt, took a look at the video, that answers my original question.

Also is there a preferred monetization model, e.g. do Google think advertisements are more or less harmful to the user experience than affiliate links, sponsored posts, etc?

Obviously across different models you can't just track space taken up, so is there some kind of metric that tracks the rate of content diluting via monetization?


I know you were looking for an answer from Matt but I thought I would offer up my opinion here (as you might have noticed, I love talking about this stuff).

Our models currently suggest that the presence of contextual advertising is a significant predictive factor of webspam.

We use 10-fold bagging and classification trees, so it's not all that easy to generalize. But I pulled one model out at random for fun.

The top predictive factor in this particular model is the probability outcome of the bigrams (word pairs) extracted the visible text on the page. Here are a few significant bigrams:

relocationcompanies productsproviding productspure qualitybook recruitmentwebsite ticketsour thesetraffic representingclients todayplay tourshigh registryrepair rentproperties weddingportal printingcanvas prhuman privacyprotection providingefficient waytrade printingstationery priceseverything website*daily

Next, this model looks for tokens extracted from the URL and particular meta tags from the page. Similar to above, but I believe unigrams only. A few examples follow. Please keep in mind that none of these phrases are used individually... they are each weighted and combined with all other known factors on the page:

offer review book more Management into Web Library blog Joomla forums

The model then looks at the outdegree of the page (number of unique domains pointed to).

From there, it breaks down into TLD (.biz, .ru, .gov, etc)

The file gets pretty hard to decipher at this point (it's a huge XML file) but contextual advertising is used as a predictive variable throughout.

Just from eyeballing it, it appears to be more or less as significant as the precision and recall rate of high value commercial terms, average word length (western languages only), and visible text length.

Based on what I'm looking at right now, my answer would be that sponsored posts are going to be far more harmful to the user experience than advertising.

Can't answer the rest of your question which I assume relates to the number of ad blocks or amount of space taken up by ads... we don't measure it.

Edit: Just realized that Google will probably delist this page within 24 hours. Should've used a gif for those bigrams. Oh well ;-)


Thanks for your response, is the data you are viewing publicly available?


No...


Sadly, quite true.

"True" - because my current understanding (which Matt_Cutts can elucidate on if he chooses to) is that Google has looked into - but does not currently incorporate - the presence of advertising as a spam signal.

"Sadly" - because my independent research has shown that advertising - most notably the presence of Google AdSense - is a reliable predictive variable of a page being spam.

All things being equal, a page with AdSense blocks on it is far more likely to be spam. Yet as of a few months ago, that does not appear to weigh very heavily into the equation.


I agree but don't you think that from an algorithmic point of view Google would be better looking at what the user wants and what monetization models they prefer to see versus the averages in terms of monetization models on spam sites.

That way their focussing less on removing spammers and more on user quality and thus removing spam.


This really made me think. But I took exception to your comment:

I agree but don't you think that from an algorithmic point of view Google would be better looking at what the user wants and what monetization models they prefer[...]

No. In fact, I am a rather loudmouthed opponent to Google's somewhat clumsy attempts to measure this ala "Quality Score".

In addition to webspam detection and machine learning, I have spent way too much time in marketing (I have a master's degree in marketing, in fact.)

A neat thing I learned along the way was the value of market research.

There are so many nuances in every line of business. Segments, preferences, pricing, even down to minutia (now well studied) such as fonts, gutter widths, copy styles, and so on.

You can learn a lot by combining large amounts of data and well chosen machine learning algos. But even with a few thousand businesses in most categories in a particular country (far less outside of the US), that doesn't give an outsider enough data to truly distinguish what can be a winning formula from a spammy one. This knowledge is hard won through carefully executed experiments and research.

A few years ago I was researching the topic of landing page formulas by category. One example that stuck out most in my mind was mortgages. There were a few tried and true "formulas" that significantly outperformed the rest. Two stuck out:

1) Man, woman, and sometimes child standing on a green lawn in front of home. Arrow pointing down from top left of landing page to mid/lower right positioned form. Form limited to three fields.

Edit: http://imgur.com/90VmB

2) Picture of home/s docked to bottom of lead gen page. No people. Light/white background. Arrow pointing down from top left of landing page to centrally located form.

Edit: http://imgur.com/JkLlH

These sites were incredibly successful. More than a few of them had to contend with quality score issues over the years. Can an algorithm capture nuances such as the ones I mentioned? In theory... they could. But today, they don't. All of Google's QS algorithms to date have been failed attempts and have caused an incredible amount of harm and distrust.

You finished that sentence with:

to see versus the averages in terms of monetization models on spam sites.

I'm not at all sure what this means. Could you explain? Is it even possible to directly model the monetization model of a site without having direct access to their metrics?


Well we know Google can classify what spam sites are.

And assuming they could (despite your arguments against) determine the method of monetization on a site, then simply compare the models they see on spam sites versus another metric which would track the user's reaction to certain types of monetization models.


And assuming they could (despite your arguments against) determine the method of monetization on a site

I can assure you that that they cannot do this accurately. You would be amazed at the scummy business models that openly advertise on Google and are not caught. It would be incredibly difficult to do so as some of them are downright ingenious (one example: free software that updates your drivers.)

It sounds to me that you are positing some type of magical technology that doesn't exist predicated on Google's seeming omniscience. Of course, I am eager to stand corrected...?


I'm not declaring it is a technology they are capable of developing, just wondering, still a bit surprised that with all their top engineers they couldn't (if they wanted to) come up with a solution to this?


I think %50 of the problem is the arbitrary picking of sites to block (and it's not working, btw[1]) and %50 of it is that google seems uninterested in explaining or advising people when it happens to them.

This. Assume you work on the webspam team and you have a 92% spam detection rate but a 99.9% accuracy rate on what you do detect.

There are around 40,000,000 active domains in a given month listed on Google. That means 40,000 sites on average are being penalized without reason.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: